1*22dc650dSSadaf Ebrahimi<html> 2*22dc650dSSadaf Ebrahimi<head> 3*22dc650dSSadaf Ebrahimi<title>pcre2matching specification</title> 4*22dc650dSSadaf Ebrahimi</head> 5*22dc650dSSadaf Ebrahimi<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB"> 6*22dc650dSSadaf Ebrahimi<h1>pcre2matching man page</h1> 7*22dc650dSSadaf Ebrahimi<p> 8*22dc650dSSadaf EbrahimiReturn to the <a href="index.html">PCRE2 index page</a>. 9*22dc650dSSadaf Ebrahimi</p> 10*22dc650dSSadaf Ebrahimi<p> 11*22dc650dSSadaf EbrahimiThis page is part of the PCRE2 HTML documentation. It was generated 12*22dc650dSSadaf Ebrahimiautomatically from the original man page. If there is any nonsense in it, 13*22dc650dSSadaf Ebrahimiplease consult the man page, in case the conversion went wrong. 14*22dc650dSSadaf Ebrahimi<br> 15*22dc650dSSadaf Ebrahimi<ul> 16*22dc650dSSadaf Ebrahimi<li><a name="TOC1" href="#SEC1">PCRE2 MATCHING ALGORITHMS</a> 17*22dc650dSSadaf Ebrahimi<li><a name="TOC2" href="#SEC2">REGULAR EXPRESSIONS AS TREES</a> 18*22dc650dSSadaf Ebrahimi<li><a name="TOC3" href="#SEC3">THE STANDARD MATCHING ALGORITHM</a> 19*22dc650dSSadaf Ebrahimi<li><a name="TOC4" href="#SEC4">THE ALTERNATIVE MATCHING ALGORITHM</a> 20*22dc650dSSadaf Ebrahimi<li><a name="TOC5" href="#SEC5">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a> 21*22dc650dSSadaf Ebrahimi<li><a name="TOC6" href="#SEC6">DISADVANTAGES OF THE ALTERNATIVE ALGORITHM</a> 22*22dc650dSSadaf Ebrahimi<li><a name="TOC7" href="#SEC7">AUTHOR</a> 23*22dc650dSSadaf Ebrahimi<li><a name="TOC8" href="#SEC8">REVISION</a> 24*22dc650dSSadaf Ebrahimi</ul> 25*22dc650dSSadaf Ebrahimi<br><a name="SEC1" href="#TOC1">PCRE2 MATCHING ALGORITHMS</a><br> 26*22dc650dSSadaf Ebrahimi<P> 27*22dc650dSSadaf EbrahimiThis document describes the two different algorithms that are available in 28*22dc650dSSadaf EbrahimiPCRE2 for matching a compiled regular expression against a given subject 29*22dc650dSSadaf Ebrahimistring. The "standard" algorithm is the one provided by the <b>pcre2_match()</b> 30*22dc650dSSadaf Ebrahimifunction. This works in the same as Perl's matching function, and provide a 31*22dc650dSSadaf EbrahimiPerl-compatible matching operation. The just-in-time (JIT) optimization that is 32*22dc650dSSadaf Ebrahimidescribed in the 33*22dc650dSSadaf Ebrahimi<a href="pcre2jit.html"><b>pcre2jit</b></a> 34*22dc650dSSadaf Ebrahimidocumentation is compatible with this function. 35*22dc650dSSadaf Ebrahimi</P> 36*22dc650dSSadaf Ebrahimi<P> 37*22dc650dSSadaf EbrahimiAn alternative algorithm is provided by the <b>pcre2_dfa_match()</b> function; 38*22dc650dSSadaf Ebrahimiit operates in a different way, and is not Perl-compatible. This alternative 39*22dc650dSSadaf Ebrahimihas advantages and disadvantages compared with the standard algorithm, and 40*22dc650dSSadaf Ebrahimithese are described below. 41*22dc650dSSadaf Ebrahimi</P> 42*22dc650dSSadaf Ebrahimi<P> 43*22dc650dSSadaf EbrahimiWhen there is only one possible way in which a given subject string can match a 44*22dc650dSSadaf Ebrahimipattern, the two algorithms give the same answer. A difference arises, however, 45*22dc650dSSadaf Ebrahimiwhen there are multiple possibilities. For example, if the pattern 46*22dc650dSSadaf Ebrahimi<pre> 47*22dc650dSSadaf Ebrahimi ^<.*> 48*22dc650dSSadaf Ebrahimi</pre> 49*22dc650dSSadaf Ebrahimiis matched against the string 50*22dc650dSSadaf Ebrahimi<pre> 51*22dc650dSSadaf Ebrahimi <something> <something else> <something further> 52*22dc650dSSadaf Ebrahimi</pre> 53*22dc650dSSadaf Ebrahimithere are three possible answers. The standard algorithm finds only one of 54*22dc650dSSadaf Ebrahimithem, whereas the alternative algorithm finds all three. 55*22dc650dSSadaf Ebrahimi</P> 56*22dc650dSSadaf Ebrahimi<br><a name="SEC2" href="#TOC1">REGULAR EXPRESSIONS AS TREES</a><br> 57*22dc650dSSadaf Ebrahimi<P> 58*22dc650dSSadaf EbrahimiThe set of strings that are matched by a regular expression can be represented 59*22dc650dSSadaf Ebrahimias a tree structure. An unlimited repetition in the pattern makes the tree of 60*22dc650dSSadaf Ebrahimiinfinite size, but it is still a tree. Matching the pattern to a given subject 61*22dc650dSSadaf Ebrahimistring (from a given starting point) can be thought of as a search of the tree. 62*22dc650dSSadaf EbrahimiThere are two ways to search a tree: depth-first and breadth-first, and these 63*22dc650dSSadaf Ebrahimicorrespond to the two matching algorithms provided by PCRE2. 64*22dc650dSSadaf Ebrahimi</P> 65*22dc650dSSadaf Ebrahimi<br><a name="SEC3" href="#TOC1">THE STANDARD MATCHING ALGORITHM</a><br> 66*22dc650dSSadaf Ebrahimi<P> 67*22dc650dSSadaf EbrahimiIn the terminology of Jeffrey Friedl's book "Mastering Regular Expressions", 68*22dc650dSSadaf Ebrahimithe standard algorithm is an "NFA algorithm". It conducts a depth-first search 69*22dc650dSSadaf Ebrahimiof the pattern tree. That is, it proceeds along a single path through the tree, 70*22dc650dSSadaf Ebrahimichecking that the subject matches what is required. When there is a mismatch, 71*22dc650dSSadaf Ebrahimithe algorithm tries any alternatives at the current point, and if they all 72*22dc650dSSadaf Ebrahimifail, it backs up to the previous branch point in the tree, and tries the next 73*22dc650dSSadaf Ebrahimialternative branch at that level. This often involves backing up (moving to the 74*22dc650dSSadaf Ebrahimileft) in the subject string as well. The order in which repetition branches are 75*22dc650dSSadaf Ebrahimitried is controlled by the greedy or ungreedy nature of the quantifier. 76*22dc650dSSadaf Ebrahimi</P> 77*22dc650dSSadaf Ebrahimi<P> 78*22dc650dSSadaf EbrahimiIf a leaf node is reached, a matching string has been found, and at that point 79*22dc650dSSadaf Ebrahimithe algorithm stops. Thus, if there is more than one possible match, this 80*22dc650dSSadaf Ebrahimialgorithm returns the first one that it finds. Whether this is the shortest, 81*22dc650dSSadaf Ebrahimithe longest, or some intermediate length depends on the way the alternations 82*22dc650dSSadaf Ebrahimiand the greedy or ungreedy repetition quantifiers are specified in the 83*22dc650dSSadaf Ebrahimipattern. 84*22dc650dSSadaf Ebrahimi</P> 85*22dc650dSSadaf Ebrahimi<P> 86*22dc650dSSadaf EbrahimiBecause it ends up with a single path through the tree, it is relatively 87*22dc650dSSadaf Ebrahimistraightforward for this algorithm to keep track of the substrings that are 88*22dc650dSSadaf Ebrahimimatched by portions of the pattern in parentheses. This provides support for 89*22dc650dSSadaf Ebrahimicapturing parentheses and backreferences. 90*22dc650dSSadaf Ebrahimi</P> 91*22dc650dSSadaf Ebrahimi<br><a name="SEC4" href="#TOC1">THE ALTERNATIVE MATCHING ALGORITHM</a><br> 92*22dc650dSSadaf Ebrahimi<P> 93*22dc650dSSadaf EbrahimiThis algorithm conducts a breadth-first search of the tree. Starting from the 94*22dc650dSSadaf Ebrahimifirst matching point in the subject, it scans the subject string from left to 95*22dc650dSSadaf Ebrahimiright, once, character by character, and as it does this, it remembers all the 96*22dc650dSSadaf Ebrahimipaths through the tree that represent valid matches. In Friedl's terminology, 97*22dc650dSSadaf Ebrahimithis is a kind of "DFA algorithm", though it is not implemented as a 98*22dc650dSSadaf Ebrahimitraditional finite state machine (it keeps multiple states active 99*22dc650dSSadaf Ebrahimisimultaneously). 100*22dc650dSSadaf Ebrahimi</P> 101*22dc650dSSadaf Ebrahimi<P> 102*22dc650dSSadaf EbrahimiAlthough the general principle of this matching algorithm is that it scans the 103*22dc650dSSadaf Ebrahimisubject string only once, without backtracking, there is one exception: when a 104*22dc650dSSadaf Ebrahimilookaround assertion is encountered, the characters following or preceding the 105*22dc650dSSadaf Ebrahimicurrent point have to be independently inspected. 106*22dc650dSSadaf Ebrahimi</P> 107*22dc650dSSadaf Ebrahimi<P> 108*22dc650dSSadaf EbrahimiThe scan continues until either the end of the subject is reached, or there are 109*22dc650dSSadaf Ebrahimino more unterminated paths. At this point, terminated paths represent the 110*22dc650dSSadaf Ebrahimidifferent matching possibilities (if there are none, the match has failed). 111*22dc650dSSadaf EbrahimiThus, if there is more than one possible match, this algorithm finds all of 112*22dc650dSSadaf Ebrahimithem, and in particular, it finds the longest. The matches are returned in 113*22dc650dSSadaf Ebrahimithe output vector in decreasing order of length. There is an option to stop the 114*22dc650dSSadaf Ebrahimialgorithm after the first match (which is necessarily the shortest) is found. 115*22dc650dSSadaf Ebrahimi</P> 116*22dc650dSSadaf Ebrahimi<P> 117*22dc650dSSadaf EbrahimiNote that the size of vector needed to contain all the results depends on the 118*22dc650dSSadaf Ebrahiminumber of simultaneous matches, not on the number of parentheses in the 119*22dc650dSSadaf Ebrahimipattern. Using <b>pcre2_match_data_create_from_pattern()</b> to create the match 120*22dc650dSSadaf Ebrahimidata block is therefore not advisable when doing DFA matching. 121*22dc650dSSadaf Ebrahimi</P> 122*22dc650dSSadaf Ebrahimi<P> 123*22dc650dSSadaf EbrahimiNote also that all the matches that are found start at the same point in the 124*22dc650dSSadaf Ebrahimisubject. If the pattern 125*22dc650dSSadaf Ebrahimi<pre> 126*22dc650dSSadaf Ebrahimi cat(er(pillar)?)? 127*22dc650dSSadaf Ebrahimi</pre> 128*22dc650dSSadaf Ebrahimiis matched against the string "the caterpillar catchment", the result is the 129*22dc650dSSadaf Ebrahimithree strings "caterpillar", "cater", and "cat" that start at the fifth 130*22dc650dSSadaf Ebrahimicharacter of the subject. The algorithm does not automatically move on to find 131*22dc650dSSadaf Ebrahimimatches that start at later positions. 132*22dc650dSSadaf Ebrahimi</P> 133*22dc650dSSadaf Ebrahimi<P> 134*22dc650dSSadaf EbrahimiPCRE2's "auto-possessification" optimization usually applies to character 135*22dc650dSSadaf Ebrahimirepeats at the end of a pattern (as well as internally). For example, the 136*22dc650dSSadaf Ebrahimipattern "a\d+" is compiled as if it were "a\d++" because there is no point 137*22dc650dSSadaf Ebrahimieven considering the possibility of backtracking into the repeated digits. For 138*22dc650dSSadaf EbrahimiDFA matching, this means that only one possible match is found. If you really 139*22dc650dSSadaf Ebrahimido want multiple matches in such cases, either use an ungreedy repeat 140*22dc650dSSadaf Ebrahimi("a\d+?") or set the PCRE2_NO_AUTO_POSSESS option when compiling. 141*22dc650dSSadaf Ebrahimi</P> 142*22dc650dSSadaf Ebrahimi<P> 143*22dc650dSSadaf EbrahimiThere are a number of features of PCRE2 regular expressions that are not 144*22dc650dSSadaf Ebrahimisupported or behave differently in the alternative matching function. Those 145*22dc650dSSadaf Ebrahimithat are not supported cause an error if encountered. 146*22dc650dSSadaf Ebrahimi</P> 147*22dc650dSSadaf Ebrahimi<P> 148*22dc650dSSadaf Ebrahimi1. Because the algorithm finds all possible matches, the greedy or ungreedy 149*22dc650dSSadaf Ebrahiminature of repetition quantifiers is not relevant (though it may affect 150*22dc650dSSadaf Ebrahimiauto-possessification, as just described). During matching, greedy and ungreedy 151*22dc650dSSadaf Ebrahimiquantifiers are treated in exactly the same way. However, possessive 152*22dc650dSSadaf Ebrahimiquantifiers can make a difference when what follows could also match what is 153*22dc650dSSadaf Ebrahimiquantified, for example in a pattern like this: 154*22dc650dSSadaf Ebrahimi<pre> 155*22dc650dSSadaf Ebrahimi ^a++\w! 156*22dc650dSSadaf Ebrahimi</pre> 157*22dc650dSSadaf EbrahimiThis pattern matches "aaab!" but not "aaa!", which would be matched by a 158*22dc650dSSadaf Ebrahiminon-possessive quantifier. Similarly, if an atomic group is present, it is 159*22dc650dSSadaf Ebrahimimatched as if it were a standalone pattern at the current point, and the 160*22dc650dSSadaf Ebrahimilongest match is then "locked in" for the rest of the overall pattern. 161*22dc650dSSadaf Ebrahimi</P> 162*22dc650dSSadaf Ebrahimi<P> 163*22dc650dSSadaf Ebrahimi2. When dealing with multiple paths through the tree simultaneously, it is not 164*22dc650dSSadaf Ebrahimistraightforward to keep track of captured substrings for the different matching 165*22dc650dSSadaf Ebrahimipossibilities, and PCRE2's implementation of this algorithm does not attempt to 166*22dc650dSSadaf Ebrahimido this. This means that no captured substrings are available. 167*22dc650dSSadaf Ebrahimi</P> 168*22dc650dSSadaf Ebrahimi<P> 169*22dc650dSSadaf Ebrahimi3. Because no substrings are captured, backreferences within the pattern are 170*22dc650dSSadaf Ebrahiminot supported. 171*22dc650dSSadaf Ebrahimi</P> 172*22dc650dSSadaf Ebrahimi<P> 173*22dc650dSSadaf Ebrahimi4. For the same reason, conditional expressions that use a backreference as the 174*22dc650dSSadaf Ebrahimicondition or test for a specific group recursion are not supported. 175*22dc650dSSadaf Ebrahimi</P> 176*22dc650dSSadaf Ebrahimi<P> 177*22dc650dSSadaf Ebrahimi5. Again for the same reason, script runs are not supported. 178*22dc650dSSadaf Ebrahimi</P> 179*22dc650dSSadaf Ebrahimi<P> 180*22dc650dSSadaf Ebrahimi6. Because many paths through the tree may be active, the \K escape sequence, 181*22dc650dSSadaf Ebrahimiwhich resets the start of the match when encountered (but may be on some paths 182*22dc650dSSadaf Ebrahimiand not on others), is not supported. 183*22dc650dSSadaf Ebrahimi</P> 184*22dc650dSSadaf Ebrahimi<P> 185*22dc650dSSadaf Ebrahimi7. Callouts are supported, but the value of the <i>capture_top</i> field is 186*22dc650dSSadaf Ebrahimialways 1, and the value of the <i>capture_last</i> field is always 0. 187*22dc650dSSadaf Ebrahimi</P> 188*22dc650dSSadaf Ebrahimi<P> 189*22dc650dSSadaf Ebrahimi8. The \C escape sequence, which (in the standard algorithm) always matches a 190*22dc650dSSadaf Ebrahimisingle code unit, even in a UTF mode, is not supported in these modes, because 191*22dc650dSSadaf Ebrahimithe alternative algorithm moves through the subject string one character (not 192*22dc650dSSadaf Ebrahimicode unit) at a time, for all active paths through the tree. 193*22dc650dSSadaf Ebrahimi</P> 194*22dc650dSSadaf Ebrahimi<P> 195*22dc650dSSadaf Ebrahimi9. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not 196*22dc650dSSadaf Ebrahimisupported. (*FAIL) is supported, and behaves like a failing negative assertion. 197*22dc650dSSadaf Ebrahimi</P> 198*22dc650dSSadaf Ebrahimi<P> 199*22dc650dSSadaf Ebrahimi10. The PCRE2_MATCH_INVALID_UTF option for <b>pcre2_compile()</b> is not 200*22dc650dSSadaf Ebrahimisupported by <b>pcre2_dfa_match()</b>. 201*22dc650dSSadaf Ebrahimi</P> 202*22dc650dSSadaf Ebrahimi<br><a name="SEC5" href="#TOC1">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br> 203*22dc650dSSadaf Ebrahimi<P> 204*22dc650dSSadaf EbrahimiThe main advantage of the alternative algorithm is that all possible matches 205*22dc650dSSadaf Ebrahimi(at a single point in the subject) are automatically found, and in particular, 206*22dc650dSSadaf Ebrahimithe longest match is found. To find more than one match at the same point using 207*22dc650dSSadaf Ebrahimithe standard algorithm, you have to do kludgy things with callouts. 208*22dc650dSSadaf Ebrahimi</P> 209*22dc650dSSadaf Ebrahimi<P> 210*22dc650dSSadaf EbrahimiPartial matching is possible with this algorithm, though it has some 211*22dc650dSSadaf Ebrahimilimitations. The 212*22dc650dSSadaf Ebrahimi<a href="pcre2partial.html"><b>pcre2partial</b></a> 213*22dc650dSSadaf Ebrahimidocumentation gives details of partial matching and discusses multi-segment 214*22dc650dSSadaf Ebrahimimatching. 215*22dc650dSSadaf Ebrahimi</P> 216*22dc650dSSadaf Ebrahimi<br><a name="SEC6" href="#TOC1">DISADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br> 217*22dc650dSSadaf Ebrahimi<P> 218*22dc650dSSadaf EbrahimiThe alternative algorithm suffers from a number of disadvantages: 219*22dc650dSSadaf Ebrahimi</P> 220*22dc650dSSadaf Ebrahimi<P> 221*22dc650dSSadaf Ebrahimi1. It is substantially slower than the standard algorithm. This is partly 222*22dc650dSSadaf Ebrahimibecause it has to search for all possible matches, but is also because it is 223*22dc650dSSadaf Ebrahimiless susceptible to optimization. 224*22dc650dSSadaf Ebrahimi</P> 225*22dc650dSSadaf Ebrahimi<P> 226*22dc650dSSadaf Ebrahimi2. Capturing parentheses, backreferences, script runs, and matching within 227*22dc650dSSadaf Ebrahimiinvalid UTF string are not supported. 228*22dc650dSSadaf Ebrahimi</P> 229*22dc650dSSadaf Ebrahimi<P> 230*22dc650dSSadaf Ebrahimi3. Although atomic groups are supported, their use does not provide the 231*22dc650dSSadaf Ebrahimiperformance advantage that it does for the standard algorithm. 232*22dc650dSSadaf Ebrahimi</P> 233*22dc650dSSadaf Ebrahimi<P> 234*22dc650dSSadaf Ebrahimi4. JIT optimization is not supported. 235*22dc650dSSadaf Ebrahimi</P> 236*22dc650dSSadaf Ebrahimi<br><a name="SEC7" href="#TOC1">AUTHOR</a><br> 237*22dc650dSSadaf Ebrahimi<P> 238*22dc650dSSadaf EbrahimiPhilip Hazel 239*22dc650dSSadaf Ebrahimi<br> 240*22dc650dSSadaf EbrahimiRetired from University Computing Service 241*22dc650dSSadaf Ebrahimi<br> 242*22dc650dSSadaf EbrahimiCambridge, England. 243*22dc650dSSadaf Ebrahimi<br> 244*22dc650dSSadaf Ebrahimi</P> 245*22dc650dSSadaf Ebrahimi<br><a name="SEC8" href="#TOC1">REVISION</a><br> 246*22dc650dSSadaf Ebrahimi<P> 247*22dc650dSSadaf EbrahimiLast updated: 19 January 2024 248*22dc650dSSadaf Ebrahimi<br> 249*22dc650dSSadaf EbrahimiCopyright © 1997-2024 University of Cambridge. 250*22dc650dSSadaf Ebrahimi<br> 251*22dc650dSSadaf Ebrahimi<p> 252*22dc650dSSadaf EbrahimiReturn to the <a href="index.html">PCRE2 index page</a>. 253*22dc650dSSadaf Ebrahimi</p> 254