1*22dc650dSSadaf Ebrahimi<html> 2*22dc650dSSadaf Ebrahimi<head> 3*22dc650dSSadaf Ebrahimi<title>pcre2perform specification</title> 4*22dc650dSSadaf Ebrahimi</head> 5*22dc650dSSadaf Ebrahimi<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB"> 6*22dc650dSSadaf Ebrahimi<h1>pcre2perform man page</h1> 7*22dc650dSSadaf Ebrahimi<p> 8*22dc650dSSadaf EbrahimiReturn to the <a href="index.html">PCRE2 index page</a>. 9*22dc650dSSadaf Ebrahimi</p> 10*22dc650dSSadaf Ebrahimi<p> 11*22dc650dSSadaf EbrahimiThis page is part of the PCRE2 HTML documentation. It was generated 12*22dc650dSSadaf Ebrahimiautomatically from the original man page. If there is any nonsense in it, 13*22dc650dSSadaf Ebrahimiplease consult the man page, in case the conversion went wrong. 14*22dc650dSSadaf Ebrahimi<br> 15*22dc650dSSadaf Ebrahimi<ul> 16*22dc650dSSadaf Ebrahimi<li><a name="TOC1" href="#SEC1">PCRE2 PERFORMANCE</a> 17*22dc650dSSadaf Ebrahimi<li><a name="TOC2" href="#SEC2">COMPILED PATTERN MEMORY USAGE</a> 18*22dc650dSSadaf Ebrahimi<li><a name="TOC3" href="#SEC3">STACK AND HEAP USAGE AT RUN TIME</a> 19*22dc650dSSadaf Ebrahimi<li><a name="TOC4" href="#SEC4">PROCESSING TIME</a> 20*22dc650dSSadaf Ebrahimi<li><a name="TOC5" href="#SEC5">AUTHOR</a> 21*22dc650dSSadaf Ebrahimi<li><a name="TOC6" href="#SEC6">REVISION</a> 22*22dc650dSSadaf Ebrahimi</ul> 23*22dc650dSSadaf Ebrahimi<br><a name="SEC1" href="#TOC1">PCRE2 PERFORMANCE</a><br> 24*22dc650dSSadaf Ebrahimi<P> 25*22dc650dSSadaf EbrahimiTwo aspects of performance are discussed below: memory usage and processing 26*22dc650dSSadaf Ebrahimitime. The way you express your pattern as a regular expression can affect both 27*22dc650dSSadaf Ebrahimiof them. 28*22dc650dSSadaf Ebrahimi</P> 29*22dc650dSSadaf Ebrahimi<br><a name="SEC2" href="#TOC1">COMPILED PATTERN MEMORY USAGE</a><br> 30*22dc650dSSadaf Ebrahimi<P> 31*22dc650dSSadaf EbrahimiPatterns are compiled by PCRE2 into a reasonably efficient interpretive code, 32*22dc650dSSadaf Ebrahimiso that most simple patterns do not use much memory for storing the compiled 33*22dc650dSSadaf Ebrahimiversion. However, there is one case where the memory usage of a compiled 34*22dc650dSSadaf Ebrahimipattern can be unexpectedly large. If a parenthesized group has a quantifier 35*22dc650dSSadaf Ebrahimiwith a minimum greater than 1 and/or a limited maximum, the whole group is 36*22dc650dSSadaf Ebrahimirepeated in the compiled code. For example, the pattern 37*22dc650dSSadaf Ebrahimi<pre> 38*22dc650dSSadaf Ebrahimi (abc|def){2,4} 39*22dc650dSSadaf Ebrahimi</pre> 40*22dc650dSSadaf Ebrahimiis compiled as if it were 41*22dc650dSSadaf Ebrahimi<pre> 42*22dc650dSSadaf Ebrahimi (abc|def)(abc|def)((abc|def)(abc|def)?)? 43*22dc650dSSadaf Ebrahimi</pre> 44*22dc650dSSadaf Ebrahimi(Technical aside: It is done this way so that backtrack points within each of 45*22dc650dSSadaf Ebrahimithe repetitions can be independently maintained.) 46*22dc650dSSadaf Ebrahimi</P> 47*22dc650dSSadaf Ebrahimi<P> 48*22dc650dSSadaf EbrahimiFor regular expressions whose quantifiers use only small numbers, this is not 49*22dc650dSSadaf Ebrahimiusually a problem. However, if the numbers are large, and particularly if such 50*22dc650dSSadaf Ebrahimirepetitions are nested, the memory usage can become an embarrassment. For 51*22dc650dSSadaf Ebrahimiexample, the very simple pattern 52*22dc650dSSadaf Ebrahimi<pre> 53*22dc650dSSadaf Ebrahimi ((ab){1,1000}c){1,3} 54*22dc650dSSadaf Ebrahimi</pre> 55*22dc650dSSadaf Ebrahimiuses over 50KiB when compiled using the 8-bit library. When PCRE2 is 56*22dc650dSSadaf Ebrahimicompiled with its default internal pointer size of two bytes, the size limit on 57*22dc650dSSadaf Ebrahimia compiled pattern is 65535 code units in the 8-bit and 16-bit libraries, and 58*22dc650dSSadaf Ebrahimithis is reached with the above pattern if the outer repetition is increased 59*22dc650dSSadaf Ebrahimifrom 3 to 4. PCRE2 can be compiled to use larger internal pointers and thus 60*22dc650dSSadaf Ebrahimihandle larger compiled patterns, but it is better to try to rewrite your 61*22dc650dSSadaf Ebrahimipattern to use less memory if you can. 62*22dc650dSSadaf Ebrahimi</P> 63*22dc650dSSadaf Ebrahimi<P> 64*22dc650dSSadaf EbrahimiOne way of reducing the memory usage for such patterns is to make use of 65*22dc650dSSadaf EbrahimiPCRE2's 66*22dc650dSSadaf Ebrahimi<a href="pcre2pattern.html#subpatternsassubroutines">"subroutine"</a> 67*22dc650dSSadaf Ebrahimifacility. Re-writing the above pattern as 68*22dc650dSSadaf Ebrahimi<pre> 69*22dc650dSSadaf Ebrahimi ((ab)(?2){0,999}c)(?1){0,2} 70*22dc650dSSadaf Ebrahimi</pre> 71*22dc650dSSadaf Ebrahimireduces the memory requirements to around 16KiB, and indeed it remains under 72*22dc650dSSadaf Ebrahimi20KiB even with the outer repetition increased to 100. However, this kind of 73*22dc650dSSadaf Ebrahimipattern is not always exactly equivalent, because any captures within 74*22dc650dSSadaf Ebrahimisubroutine calls are lost when the subroutine completes. If this is not a 75*22dc650dSSadaf Ebrahimiproblem, this kind of rewriting will allow you to process patterns that PCRE2 76*22dc650dSSadaf Ebrahimicannot otherwise handle. The matching performance of the two different versions 77*22dc650dSSadaf Ebrahimiof the pattern are roughly the same. (This applies from release 10.30 - things 78*22dc650dSSadaf Ebrahimiwere different in earlier releases.) 79*22dc650dSSadaf Ebrahimi</P> 80*22dc650dSSadaf Ebrahimi<br><a name="SEC3" href="#TOC1">STACK AND HEAP USAGE AT RUN TIME</a><br> 81*22dc650dSSadaf Ebrahimi<P> 82*22dc650dSSadaf EbrahimiFrom release 10.30, the interpretive (non-JIT) version of <b>pcre2_match()</b> 83*22dc650dSSadaf Ebrahimiuses very little system stack at run time. In earlier releases recursive 84*22dc650dSSadaf Ebrahimifunction calls could use a great deal of stack, and this could cause problems, 85*22dc650dSSadaf Ebrahimibut this usage has been eliminated. Backtracking positions are now explicitly 86*22dc650dSSadaf Ebrahimiremembered in memory frames controlled by the code. 87*22dc650dSSadaf Ebrahimi</P> 88*22dc650dSSadaf Ebrahimi<P> 89*22dc650dSSadaf EbrahimiThe size of each frame depends on the size of pointer variables and the number 90*22dc650dSSadaf Ebrahimiof capturing parenthesized groups in the pattern being matched. On a 64-bit 91*22dc650dSSadaf Ebrahimisystem the frame size for a pattern with no captures is 128 bytes. For each 92*22dc650dSSadaf Ebrahimicapturing group the size increases by 16 bytes. 93*22dc650dSSadaf Ebrahimi</P> 94*22dc650dSSadaf Ebrahimi<P> 95*22dc650dSSadaf EbrahimiUntil release 10.41, an initial 20KiB frames vector was allocated on the system 96*22dc650dSSadaf Ebrahimistack, but this still caused some issues for multi-thread applications where 97*22dc650dSSadaf Ebrahimieach thread has a very small stack. From release 10.41 backtracking memory 98*22dc650dSSadaf Ebrahimiframes are always held in heap memory. An initial heap allocation is obtained 99*22dc650dSSadaf Ebrahimithe first time any match data block is passed to <b>pcre2_match()</b>. This is 100*22dc650dSSadaf Ebrahimiremembered with the match data block and re-used if that block is used for 101*22dc650dSSadaf Ebrahimianother match. It is freed when the match data block itself is freed. 102*22dc650dSSadaf Ebrahimi</P> 103*22dc650dSSadaf Ebrahimi<P> 104*22dc650dSSadaf EbrahimiThe size of the initial block is the larger of 20KiB or ten times the pattern's 105*22dc650dSSadaf Ebrahimiframe size, unless the heap limit is less than this, in which case the heap 106*22dc650dSSadaf Ebrahimilimit is used. If the initial block proves to be too small during matching, it 107*22dc650dSSadaf Ebrahimiis replaced by a larger block, subject to the heap limit. The heap limit is 108*22dc650dSSadaf Ebrahimichecked only when a new block is to be allocated. Reducing the heap limit 109*22dc650dSSadaf Ebrahimibetween calls to <b>pcre2_match()</b> with the same match data block does not 110*22dc650dSSadaf Ebrahimiaffect the saved block. 111*22dc650dSSadaf Ebrahimi</P> 112*22dc650dSSadaf Ebrahimi<P> 113*22dc650dSSadaf EbrahimiIn contrast to <b>pcre2_match()</b>, <b>pcre2_dfa_match()</b> does use recursive 114*22dc650dSSadaf Ebrahimifunction calls, but only for processing atomic groups, lookaround assertions, 115*22dc650dSSadaf Ebrahimiand recursion within the pattern. The original version of the code used to 116*22dc650dSSadaf Ebrahimiallocate quite large internal workspace vectors on the stack, which caused some 117*22dc650dSSadaf Ebrahimiproblems for some patterns in environments with small stacks. From release 118*22dc650dSSadaf Ebrahimi10.32 the code for <b>pcre2_dfa_match()</b> has been re-factored to use heap 119*22dc650dSSadaf Ebrahimimemory when necessary for internal workspace when recursing, though recursive 120*22dc650dSSadaf Ebrahimifunction calls are still used. 121*22dc650dSSadaf Ebrahimi</P> 122*22dc650dSSadaf Ebrahimi<P> 123*22dc650dSSadaf EbrahimiThe "match depth" parameter can be used to limit the depth of function 124*22dc650dSSadaf Ebrahimirecursion, and the "match heap" parameter to limit heap memory in 125*22dc650dSSadaf Ebrahimi<b>pcre2_dfa_match()</b>. 126*22dc650dSSadaf Ebrahimi</P> 127*22dc650dSSadaf Ebrahimi<br><a name="SEC4" href="#TOC1">PROCESSING TIME</a><br> 128*22dc650dSSadaf Ebrahimi<P> 129*22dc650dSSadaf EbrahimiCertain items in regular expression patterns are processed more efficiently 130*22dc650dSSadaf Ebrahimithan others. It is more efficient to use a character class like [aeiou] than a 131*22dc650dSSadaf Ebrahimiset of single-character alternatives such as (a|e|i|o|u). In general, the 132*22dc650dSSadaf Ebrahimisimplest construction that provides the required behaviour is usually the most 133*22dc650dSSadaf Ebrahimiefficient. Jeffrey Friedl's book contains a lot of useful general discussion 134*22dc650dSSadaf Ebrahimiabout optimizing regular expressions for efficient performance. This document 135*22dc650dSSadaf Ebrahimicontains a few observations about PCRE2. 136*22dc650dSSadaf Ebrahimi</P> 137*22dc650dSSadaf Ebrahimi<P> 138*22dc650dSSadaf EbrahimiUsing Unicode character properties (the \p, \P, and \X escapes) is slow, 139*22dc650dSSadaf Ebrahimibecause PCRE2 has to use a multi-stage table lookup whenever it needs a 140*22dc650dSSadaf Ebrahimicharacter's property. If you can find an alternative pattern that does not use 141*22dc650dSSadaf Ebrahimicharacter properties, it will probably be faster. 142*22dc650dSSadaf Ebrahimi</P> 143*22dc650dSSadaf Ebrahimi<P> 144*22dc650dSSadaf EbrahimiBy default, the escape sequences \b, \d, \s, and \w, and the POSIX 145*22dc650dSSadaf Ebrahimicharacter classes such as [:alpha:] do not use Unicode properties, partly for 146*22dc650dSSadaf Ebrahimibackwards compatibility, and partly for performance reasons. However, you can 147*22dc650dSSadaf Ebrahimiset the PCRE2_UCP option or start the pattern with (*UCP) if you want Unicode 148*22dc650dSSadaf Ebrahimicharacter properties to be used. This can double the matching time for items 149*22dc650dSSadaf Ebrahimisuch as \d, when matched with <b>pcre2_match()</b>; the performance loss is 150*22dc650dSSadaf Ebrahimiless with a DFA matching function, and in both cases there is not much 151*22dc650dSSadaf Ebrahimidifference for \b. 152*22dc650dSSadaf Ebrahimi</P> 153*22dc650dSSadaf Ebrahimi<P> 154*22dc650dSSadaf EbrahimiWhen a pattern begins with .* not in atomic parentheses, nor in parentheses 155*22dc650dSSadaf Ebrahimithat are the subject of a backreference, and the PCRE2_DOTALL option is set, 156*22dc650dSSadaf Ebrahimithe pattern is implicitly anchored by PCRE2, since it can match only at the 157*22dc650dSSadaf Ebrahimistart of a subject string. If the pattern has multiple top-level branches, they 158*22dc650dSSadaf Ebrahimimust all be anchorable. The optimization can be disabled by the 159*22dc650dSSadaf EbrahimiPCRE2_NO_DOTSTAR_ANCHOR option, and is automatically disabled if the pattern 160*22dc650dSSadaf Ebrahimicontains (*PRUNE) or (*SKIP). 161*22dc650dSSadaf Ebrahimi</P> 162*22dc650dSSadaf Ebrahimi<P> 163*22dc650dSSadaf EbrahimiIf PCRE2_DOTALL is not set, PCRE2 cannot make this optimization, because the 164*22dc650dSSadaf Ebrahimidot metacharacter does not then match a newline, and if the subject string 165*22dc650dSSadaf Ebrahimicontains newlines, the pattern may match from the character immediately 166*22dc650dSSadaf Ebrahimifollowing one of them instead of from the very start. For example, the pattern 167*22dc650dSSadaf Ebrahimi<pre> 168*22dc650dSSadaf Ebrahimi .*second 169*22dc650dSSadaf Ebrahimi</pre> 170*22dc650dSSadaf Ebrahimimatches the subject "first\nand second" (where \n stands for a newline 171*22dc650dSSadaf Ebrahimicharacter), with the match starting at the seventh character. In order to do 172*22dc650dSSadaf Ebrahimithis, PCRE2 has to retry the match starting after every newline in the subject. 173*22dc650dSSadaf Ebrahimi</P> 174*22dc650dSSadaf Ebrahimi<P> 175*22dc650dSSadaf EbrahimiIf you are using such a pattern with subject strings that do not contain 176*22dc650dSSadaf Ebrahiminewlines, the best performance is obtained by setting PCRE2_DOTALL, or starting 177*22dc650dSSadaf Ebrahimithe pattern with ^.* or ^.*? to indicate explicit anchoring. That saves PCRE2 178*22dc650dSSadaf Ebrahimifrom having to scan along the subject looking for a newline to restart at. 179*22dc650dSSadaf Ebrahimi</P> 180*22dc650dSSadaf Ebrahimi<P> 181*22dc650dSSadaf EbrahimiBeware of patterns that contain nested indefinite repeats. These can take a 182*22dc650dSSadaf Ebrahimilong time to run when applied to a string that does not match. Consider the 183*22dc650dSSadaf Ebrahimipattern fragment 184*22dc650dSSadaf Ebrahimi<pre> 185*22dc650dSSadaf Ebrahimi ^(a+)* 186*22dc650dSSadaf Ebrahimi</pre> 187*22dc650dSSadaf EbrahimiThis can match "aaaa" in 16 different ways, and this number increases very 188*22dc650dSSadaf Ebrahimirapidly as the string gets longer. (The * repeat can match 0, 1, 2, 3, or 4 189*22dc650dSSadaf Ebrahimitimes, and for each of those cases other than 0 or 4, the + repeats can match 190*22dc650dSSadaf Ebrahimidifferent numbers of times.) When the remainder of the pattern is such that the 191*22dc650dSSadaf Ebrahimientire match is going to fail, PCRE2 has in principle to try every possible 192*22dc650dSSadaf Ebrahimivariation, and this can take an extremely long time, even for relatively short 193*22dc650dSSadaf Ebrahimistrings. 194*22dc650dSSadaf Ebrahimi</P> 195*22dc650dSSadaf Ebrahimi<P> 196*22dc650dSSadaf EbrahimiAn optimization catches some of the more simple cases such as 197*22dc650dSSadaf Ebrahimi<pre> 198*22dc650dSSadaf Ebrahimi (a+)*b 199*22dc650dSSadaf Ebrahimi</pre> 200*22dc650dSSadaf Ebrahimiwhere a literal character follows. Before embarking on the standard matching 201*22dc650dSSadaf Ebrahimiprocedure, PCRE2 checks that there is a "b" later in the subject string, and if 202*22dc650dSSadaf Ebrahimithere is not, it fails the match immediately. However, when there is no 203*22dc650dSSadaf Ebrahimifollowing literal this optimization cannot be used. You can see the difference 204*22dc650dSSadaf Ebrahimiby comparing the behaviour of 205*22dc650dSSadaf Ebrahimi<pre> 206*22dc650dSSadaf Ebrahimi (a+)*\d 207*22dc650dSSadaf Ebrahimi</pre> 208*22dc650dSSadaf Ebrahimiwith the pattern above. The former gives a failure almost instantly when 209*22dc650dSSadaf Ebrahimiapplied to a whole line of "a" characters, whereas the latter takes an 210*22dc650dSSadaf Ebrahimiappreciable time with strings longer than about 20 characters. 211*22dc650dSSadaf Ebrahimi</P> 212*22dc650dSSadaf Ebrahimi<P> 213*22dc650dSSadaf EbrahimiIn many cases, the solution to this kind of performance issue is to use an 214*22dc650dSSadaf Ebrahimiatomic group or a possessive quantifier. This can often reduce memory 215*22dc650dSSadaf Ebrahimirequirements as well. As another example, consider this pattern: 216*22dc650dSSadaf Ebrahimi<pre> 217*22dc650dSSadaf Ebrahimi ([^<]|<(?!inet))+ 218*22dc650dSSadaf Ebrahimi</pre> 219*22dc650dSSadaf EbrahimiIt matches from wherever it starts until it encounters "<inet" or the end of 220*22dc650dSSadaf Ebrahimithe data, and is the kind of pattern that might be used when processing an XML 221*22dc650dSSadaf Ebrahimifile. Each iteration of the outer parentheses matches either one character that 222*22dc650dSSadaf Ebrahimiis not "<" or a "<" that is not followed by "inet". However, each time a 223*22dc650dSSadaf Ebrahimiparenthesis is processed, a backtracking position is passed, so this 224*22dc650dSSadaf Ebrahimiformulation uses a memory frame for each matched character. For a long string, 225*22dc650dSSadaf Ebrahimia lot of memory is required. Consider now this rewritten pattern, which matches 226*22dc650dSSadaf Ebrahimiexactly the same strings: 227*22dc650dSSadaf Ebrahimi<pre> 228*22dc650dSSadaf Ebrahimi ([^<]++|<(?!inet))+ 229*22dc650dSSadaf Ebrahimi</pre> 230*22dc650dSSadaf EbrahimiThis runs much faster, because sequences of characters that do not contain "<" 231*22dc650dSSadaf Ebrahimiare "swallowed" in one item inside the parentheses, and a possessive quantifier 232*22dc650dSSadaf Ebrahimiis used to stop any backtracking into the runs of non-"<" characters. This 233*22dc650dSSadaf Ebrahimiversion also uses a lot less memory because entry to a new set of parentheses 234*22dc650dSSadaf Ebrahimihappens only when a "<" character that is not followed by "inet" is encountered 235*22dc650dSSadaf Ebrahimi(and we assume this is relatively rare). 236*22dc650dSSadaf Ebrahimi</P> 237*22dc650dSSadaf Ebrahimi<P> 238*22dc650dSSadaf EbrahimiThis example shows that one way of optimizing performance when matching long 239*22dc650dSSadaf Ebrahimisubject strings is to write repeated parenthesized subpatterns to match more 240*22dc650dSSadaf Ebrahimithan one character whenever possible. 241*22dc650dSSadaf Ebrahimi</P> 242*22dc650dSSadaf Ebrahimi<br><b> 243*22dc650dSSadaf EbrahimiSETTING RESOURCE LIMITS 244*22dc650dSSadaf Ebrahimi</b><br> 245*22dc650dSSadaf Ebrahimi<P> 246*22dc650dSSadaf EbrahimiYou can set limits on the amount of processing that takes place when matching, 247*22dc650dSSadaf Ebrahimiand on the amount of heap memory that is used. The default values of the limits 248*22dc650dSSadaf Ebrahimiare very large, and unlikely ever to operate. They can be changed when PCRE2 is 249*22dc650dSSadaf Ebrahimibuilt, and they can also be set when <b>pcre2_match()</b> or 250*22dc650dSSadaf Ebrahimi<b>pcre2_dfa_match()</b> is called. For details of these interfaces, see the 251*22dc650dSSadaf Ebrahimi<a href="pcre2build.html"><b>pcre2build</b></a> 252*22dc650dSSadaf Ebrahimidocumentation and the section entitled 253*22dc650dSSadaf Ebrahimi<a href="pcre2api.html#matchcontext">"The match context"</a> 254*22dc650dSSadaf Ebrahimiin the 255*22dc650dSSadaf Ebrahimi<a href="pcre2api.html"><b>pcre2api</b></a> 256*22dc650dSSadaf Ebrahimidocumentation. 257*22dc650dSSadaf Ebrahimi</P> 258*22dc650dSSadaf Ebrahimi<P> 259*22dc650dSSadaf EbrahimiThe <b>pcre2test</b> test program has a modifier called "find_limits" which, if 260*22dc650dSSadaf Ebrahimiapplied to a subject line, causes it to find the smallest limits that allow a 261*22dc650dSSadaf Ebrahimipattern to match. This is done by repeatedly matching with different limits. 262*22dc650dSSadaf Ebrahimi</P> 263*22dc650dSSadaf Ebrahimi<br><a name="SEC5" href="#TOC1">AUTHOR</a><br> 264*22dc650dSSadaf Ebrahimi<P> 265*22dc650dSSadaf EbrahimiPhilip Hazel 266*22dc650dSSadaf Ebrahimi<br> 267*22dc650dSSadaf EbrahimiRetired from University Computing Service 268*22dc650dSSadaf Ebrahimi<br> 269*22dc650dSSadaf EbrahimiCambridge, England. 270*22dc650dSSadaf Ebrahimi<br> 271*22dc650dSSadaf Ebrahimi</P> 272*22dc650dSSadaf Ebrahimi<br><a name="SEC6" href="#TOC1">REVISION</a><br> 273*22dc650dSSadaf Ebrahimi<P> 274*22dc650dSSadaf EbrahimiLast updated: 27 July 2022 275*22dc650dSSadaf Ebrahimi<br> 276*22dc650dSSadaf EbrahimiCopyright © 1997-2022 University of Cambridge. 277*22dc650dSSadaf Ebrahimi<br> 278*22dc650dSSadaf Ebrahimi<p> 279*22dc650dSSadaf EbrahimiReturn to the <a href="index.html">PCRE2 index page</a>. 280*22dc650dSSadaf Ebrahimi</p> 281