xref: /aosp_15_r20/external/pcre/doc/html/pcre2matching.html (revision 22dc650d8ae982c6770746019a6f94af92b0f024)
1*22dc650dSSadaf Ebrahimi<html>
2*22dc650dSSadaf Ebrahimi<head>
3*22dc650dSSadaf Ebrahimi<title>pcre2matching specification</title>
4*22dc650dSSadaf Ebrahimi</head>
5*22dc650dSSadaf Ebrahimi<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6*22dc650dSSadaf Ebrahimi<h1>pcre2matching man page</h1>
7*22dc650dSSadaf Ebrahimi<p>
8*22dc650dSSadaf EbrahimiReturn to the <a href="index.html">PCRE2 index page</a>.
9*22dc650dSSadaf Ebrahimi</p>
10*22dc650dSSadaf Ebrahimi<p>
11*22dc650dSSadaf EbrahimiThis page is part of the PCRE2 HTML documentation. It was generated
12*22dc650dSSadaf Ebrahimiautomatically from the original man page. If there is any nonsense in it,
13*22dc650dSSadaf Ebrahimiplease consult the man page, in case the conversion went wrong.
14*22dc650dSSadaf Ebrahimi<br>
15*22dc650dSSadaf Ebrahimi<ul>
16*22dc650dSSadaf Ebrahimi<li><a name="TOC1" href="#SEC1">PCRE2 MATCHING ALGORITHMS</a>
17*22dc650dSSadaf Ebrahimi<li><a name="TOC2" href="#SEC2">REGULAR EXPRESSIONS AS TREES</a>
18*22dc650dSSadaf Ebrahimi<li><a name="TOC3" href="#SEC3">THE STANDARD MATCHING ALGORITHM</a>
19*22dc650dSSadaf Ebrahimi<li><a name="TOC4" href="#SEC4">THE ALTERNATIVE MATCHING ALGORITHM</a>
20*22dc650dSSadaf Ebrahimi<li><a name="TOC5" href="#SEC5">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a>
21*22dc650dSSadaf Ebrahimi<li><a name="TOC6" href="#SEC6">DISADVANTAGES OF THE ALTERNATIVE ALGORITHM</a>
22*22dc650dSSadaf Ebrahimi<li><a name="TOC7" href="#SEC7">AUTHOR</a>
23*22dc650dSSadaf Ebrahimi<li><a name="TOC8" href="#SEC8">REVISION</a>
24*22dc650dSSadaf Ebrahimi</ul>
25*22dc650dSSadaf Ebrahimi<br><a name="SEC1" href="#TOC1">PCRE2 MATCHING ALGORITHMS</a><br>
26*22dc650dSSadaf Ebrahimi<P>
27*22dc650dSSadaf EbrahimiThis document describes the two different algorithms that are available in
28*22dc650dSSadaf EbrahimiPCRE2 for matching a compiled regular expression against a given subject
29*22dc650dSSadaf Ebrahimistring. The "standard" algorithm is the one provided by the <b>pcre2_match()</b>
30*22dc650dSSadaf Ebrahimifunction. This works in the same as Perl's matching function, and provide a
31*22dc650dSSadaf EbrahimiPerl-compatible matching operation. The just-in-time (JIT) optimization that is
32*22dc650dSSadaf Ebrahimidescribed in the
33*22dc650dSSadaf Ebrahimi<a href="pcre2jit.html"><b>pcre2jit</b></a>
34*22dc650dSSadaf Ebrahimidocumentation is compatible with this function.
35*22dc650dSSadaf Ebrahimi</P>
36*22dc650dSSadaf Ebrahimi<P>
37*22dc650dSSadaf EbrahimiAn alternative algorithm is provided by the <b>pcre2_dfa_match()</b> function;
38*22dc650dSSadaf Ebrahimiit operates in a different way, and is not Perl-compatible. This alternative
39*22dc650dSSadaf Ebrahimihas advantages and disadvantages compared with the standard algorithm, and
40*22dc650dSSadaf Ebrahimithese are described below.
41*22dc650dSSadaf Ebrahimi</P>
42*22dc650dSSadaf Ebrahimi<P>
43*22dc650dSSadaf EbrahimiWhen there is only one possible way in which a given subject string can match a
44*22dc650dSSadaf Ebrahimipattern, the two algorithms give the same answer. A difference arises, however,
45*22dc650dSSadaf Ebrahimiwhen there are multiple possibilities. For example, if the pattern
46*22dc650dSSadaf Ebrahimi<pre>
47*22dc650dSSadaf Ebrahimi  ^&#60;.*&#62;
48*22dc650dSSadaf Ebrahimi</pre>
49*22dc650dSSadaf Ebrahimiis matched against the string
50*22dc650dSSadaf Ebrahimi<pre>
51*22dc650dSSadaf Ebrahimi  &#60;something&#62; &#60;something else&#62; &#60;something further&#62;
52*22dc650dSSadaf Ebrahimi</pre>
53*22dc650dSSadaf Ebrahimithere are three possible answers. The standard algorithm finds only one of
54*22dc650dSSadaf Ebrahimithem, whereas the alternative algorithm finds all three.
55*22dc650dSSadaf Ebrahimi</P>
56*22dc650dSSadaf Ebrahimi<br><a name="SEC2" href="#TOC1">REGULAR EXPRESSIONS AS TREES</a><br>
57*22dc650dSSadaf Ebrahimi<P>
58*22dc650dSSadaf EbrahimiThe set of strings that are matched by a regular expression can be represented
59*22dc650dSSadaf Ebrahimias a tree structure. An unlimited repetition in the pattern makes the tree of
60*22dc650dSSadaf Ebrahimiinfinite size, but it is still a tree. Matching the pattern to a given subject
61*22dc650dSSadaf Ebrahimistring (from a given starting point) can be thought of as a search of the tree.
62*22dc650dSSadaf EbrahimiThere are two ways to search a tree: depth-first and breadth-first, and these
63*22dc650dSSadaf Ebrahimicorrespond to the two matching algorithms provided by PCRE2.
64*22dc650dSSadaf Ebrahimi</P>
65*22dc650dSSadaf Ebrahimi<br><a name="SEC3" href="#TOC1">THE STANDARD MATCHING ALGORITHM</a><br>
66*22dc650dSSadaf Ebrahimi<P>
67*22dc650dSSadaf EbrahimiIn the terminology of Jeffrey Friedl's book "Mastering Regular Expressions",
68*22dc650dSSadaf Ebrahimithe standard algorithm is an "NFA algorithm". It conducts a depth-first search
69*22dc650dSSadaf Ebrahimiof the pattern tree. That is, it proceeds along a single path through the tree,
70*22dc650dSSadaf Ebrahimichecking that the subject matches what is required. When there is a mismatch,
71*22dc650dSSadaf Ebrahimithe algorithm tries any alternatives at the current point, and if they all
72*22dc650dSSadaf Ebrahimifail, it backs up to the previous branch point in the tree, and tries the next
73*22dc650dSSadaf Ebrahimialternative branch at that level. This often involves backing up (moving to the
74*22dc650dSSadaf Ebrahimileft) in the subject string as well. The order in which repetition branches are
75*22dc650dSSadaf Ebrahimitried is controlled by the greedy or ungreedy nature of the quantifier.
76*22dc650dSSadaf Ebrahimi</P>
77*22dc650dSSadaf Ebrahimi<P>
78*22dc650dSSadaf EbrahimiIf a leaf node is reached, a matching string has been found, and at that point
79*22dc650dSSadaf Ebrahimithe algorithm stops. Thus, if there is more than one possible match, this
80*22dc650dSSadaf Ebrahimialgorithm returns the first one that it finds. Whether this is the shortest,
81*22dc650dSSadaf Ebrahimithe longest, or some intermediate length depends on the way the alternations
82*22dc650dSSadaf Ebrahimiand the greedy or ungreedy repetition quantifiers are specified in the
83*22dc650dSSadaf Ebrahimipattern.
84*22dc650dSSadaf Ebrahimi</P>
85*22dc650dSSadaf Ebrahimi<P>
86*22dc650dSSadaf EbrahimiBecause it ends up with a single path through the tree, it is relatively
87*22dc650dSSadaf Ebrahimistraightforward for this algorithm to keep track of the substrings that are
88*22dc650dSSadaf Ebrahimimatched by portions of the pattern in parentheses. This provides support for
89*22dc650dSSadaf Ebrahimicapturing parentheses and backreferences.
90*22dc650dSSadaf Ebrahimi</P>
91*22dc650dSSadaf Ebrahimi<br><a name="SEC4" href="#TOC1">THE ALTERNATIVE MATCHING ALGORITHM</a><br>
92*22dc650dSSadaf Ebrahimi<P>
93*22dc650dSSadaf EbrahimiThis algorithm conducts a breadth-first search of the tree. Starting from the
94*22dc650dSSadaf Ebrahimifirst matching point in the subject, it scans the subject string from left to
95*22dc650dSSadaf Ebrahimiright, once, character by character, and as it does this, it remembers all the
96*22dc650dSSadaf Ebrahimipaths through the tree that represent valid matches. In Friedl's terminology,
97*22dc650dSSadaf Ebrahimithis is a kind of "DFA algorithm", though it is not implemented as a
98*22dc650dSSadaf Ebrahimitraditional finite state machine (it keeps multiple states active
99*22dc650dSSadaf Ebrahimisimultaneously).
100*22dc650dSSadaf Ebrahimi</P>
101*22dc650dSSadaf Ebrahimi<P>
102*22dc650dSSadaf EbrahimiAlthough the general principle of this matching algorithm is that it scans the
103*22dc650dSSadaf Ebrahimisubject string only once, without backtracking, there is one exception: when a
104*22dc650dSSadaf Ebrahimilookaround assertion is encountered, the characters following or preceding the
105*22dc650dSSadaf Ebrahimicurrent point have to be independently inspected.
106*22dc650dSSadaf Ebrahimi</P>
107*22dc650dSSadaf Ebrahimi<P>
108*22dc650dSSadaf EbrahimiThe scan continues until either the end of the subject is reached, or there are
109*22dc650dSSadaf Ebrahimino more unterminated paths. At this point, terminated paths represent the
110*22dc650dSSadaf Ebrahimidifferent matching possibilities (if there are none, the match has failed).
111*22dc650dSSadaf EbrahimiThus, if there is more than one possible match, this algorithm finds all of
112*22dc650dSSadaf Ebrahimithem, and in particular, it finds the longest. The matches are returned in
113*22dc650dSSadaf Ebrahimithe output vector in decreasing order of length. There is an option to stop the
114*22dc650dSSadaf Ebrahimialgorithm after the first match (which is necessarily the shortest) is found.
115*22dc650dSSadaf Ebrahimi</P>
116*22dc650dSSadaf Ebrahimi<P>
117*22dc650dSSadaf EbrahimiNote that the size of vector needed to contain all the results depends on the
118*22dc650dSSadaf Ebrahiminumber of simultaneous matches, not on the number of parentheses in the
119*22dc650dSSadaf Ebrahimipattern. Using <b>pcre2_match_data_create_from_pattern()</b> to create the match
120*22dc650dSSadaf Ebrahimidata block is therefore not advisable when doing DFA matching.
121*22dc650dSSadaf Ebrahimi</P>
122*22dc650dSSadaf Ebrahimi<P>
123*22dc650dSSadaf EbrahimiNote also that all the matches that are found start at the same point in the
124*22dc650dSSadaf Ebrahimisubject. If the pattern
125*22dc650dSSadaf Ebrahimi<pre>
126*22dc650dSSadaf Ebrahimi  cat(er(pillar)?)?
127*22dc650dSSadaf Ebrahimi</pre>
128*22dc650dSSadaf Ebrahimiis matched against the string "the caterpillar catchment", the result is the
129*22dc650dSSadaf Ebrahimithree strings "caterpillar", "cater", and "cat" that start at the fifth
130*22dc650dSSadaf Ebrahimicharacter of the subject. The algorithm does not automatically move on to find
131*22dc650dSSadaf Ebrahimimatches that start at later positions.
132*22dc650dSSadaf Ebrahimi</P>
133*22dc650dSSadaf Ebrahimi<P>
134*22dc650dSSadaf EbrahimiPCRE2's "auto-possessification" optimization usually applies to character
135*22dc650dSSadaf Ebrahimirepeats at the end of a pattern (as well as internally). For example, the
136*22dc650dSSadaf Ebrahimipattern "a\d+" is compiled as if it were "a\d++" because there is no point
137*22dc650dSSadaf Ebrahimieven considering the possibility of backtracking into the repeated digits. For
138*22dc650dSSadaf EbrahimiDFA matching, this means that only one possible match is found. If you really
139*22dc650dSSadaf Ebrahimido want multiple matches in such cases, either use an ungreedy repeat
140*22dc650dSSadaf Ebrahimi("a\d+?") or set the PCRE2_NO_AUTO_POSSESS option when compiling.
141*22dc650dSSadaf Ebrahimi</P>
142*22dc650dSSadaf Ebrahimi<P>
143*22dc650dSSadaf EbrahimiThere are a number of features of PCRE2 regular expressions that are not
144*22dc650dSSadaf Ebrahimisupported or behave differently in the alternative matching function. Those
145*22dc650dSSadaf Ebrahimithat are not supported cause an error if encountered.
146*22dc650dSSadaf Ebrahimi</P>
147*22dc650dSSadaf Ebrahimi<P>
148*22dc650dSSadaf Ebrahimi1. Because the algorithm finds all possible matches, the greedy or ungreedy
149*22dc650dSSadaf Ebrahiminature of repetition quantifiers is not relevant (though it may affect
150*22dc650dSSadaf Ebrahimiauto-possessification, as just described). During matching, greedy and ungreedy
151*22dc650dSSadaf Ebrahimiquantifiers are treated in exactly the same way. However, possessive
152*22dc650dSSadaf Ebrahimiquantifiers can make a difference when what follows could also match what is
153*22dc650dSSadaf Ebrahimiquantified, for example in a pattern like this:
154*22dc650dSSadaf Ebrahimi<pre>
155*22dc650dSSadaf Ebrahimi  ^a++\w!
156*22dc650dSSadaf Ebrahimi</pre>
157*22dc650dSSadaf EbrahimiThis pattern matches "aaab!" but not "aaa!", which would be matched by a
158*22dc650dSSadaf Ebrahiminon-possessive quantifier. Similarly, if an atomic group is present, it is
159*22dc650dSSadaf Ebrahimimatched as if it were a standalone pattern at the current point, and the
160*22dc650dSSadaf Ebrahimilongest match is then "locked in" for the rest of the overall pattern.
161*22dc650dSSadaf Ebrahimi</P>
162*22dc650dSSadaf Ebrahimi<P>
163*22dc650dSSadaf Ebrahimi2. When dealing with multiple paths through the tree simultaneously, it is not
164*22dc650dSSadaf Ebrahimistraightforward to keep track of captured substrings for the different matching
165*22dc650dSSadaf Ebrahimipossibilities, and PCRE2's implementation of this algorithm does not attempt to
166*22dc650dSSadaf Ebrahimido this. This means that no captured substrings are available.
167*22dc650dSSadaf Ebrahimi</P>
168*22dc650dSSadaf Ebrahimi<P>
169*22dc650dSSadaf Ebrahimi3. Because no substrings are captured, backreferences within the pattern are
170*22dc650dSSadaf Ebrahiminot supported.
171*22dc650dSSadaf Ebrahimi</P>
172*22dc650dSSadaf Ebrahimi<P>
173*22dc650dSSadaf Ebrahimi4. For the same reason, conditional expressions that use a backreference as the
174*22dc650dSSadaf Ebrahimicondition or test for a specific group recursion are not supported.
175*22dc650dSSadaf Ebrahimi</P>
176*22dc650dSSadaf Ebrahimi<P>
177*22dc650dSSadaf Ebrahimi5. Again for the same reason, script runs are not supported.
178*22dc650dSSadaf Ebrahimi</P>
179*22dc650dSSadaf Ebrahimi<P>
180*22dc650dSSadaf Ebrahimi6. Because many paths through the tree may be active, the \K escape sequence,
181*22dc650dSSadaf Ebrahimiwhich resets the start of the match when encountered (but may be on some paths
182*22dc650dSSadaf Ebrahimiand not on others), is not supported.
183*22dc650dSSadaf Ebrahimi</P>
184*22dc650dSSadaf Ebrahimi<P>
185*22dc650dSSadaf Ebrahimi7. Callouts are supported, but the value of the <i>capture_top</i> field is
186*22dc650dSSadaf Ebrahimialways 1, and the value of the <i>capture_last</i> field is always 0.
187*22dc650dSSadaf Ebrahimi</P>
188*22dc650dSSadaf Ebrahimi<P>
189*22dc650dSSadaf Ebrahimi8. The \C escape sequence, which (in the standard algorithm) always matches a
190*22dc650dSSadaf Ebrahimisingle code unit, even in a UTF mode, is not supported in these modes, because
191*22dc650dSSadaf Ebrahimithe alternative algorithm moves through the subject string one character (not
192*22dc650dSSadaf Ebrahimicode unit) at a time, for all active paths through the tree.
193*22dc650dSSadaf Ebrahimi</P>
194*22dc650dSSadaf Ebrahimi<P>
195*22dc650dSSadaf Ebrahimi9. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not
196*22dc650dSSadaf Ebrahimisupported. (*FAIL) is supported, and behaves like a failing negative assertion.
197*22dc650dSSadaf Ebrahimi</P>
198*22dc650dSSadaf Ebrahimi<P>
199*22dc650dSSadaf Ebrahimi10. The PCRE2_MATCH_INVALID_UTF option for <b>pcre2_compile()</b> is not
200*22dc650dSSadaf Ebrahimisupported by <b>pcre2_dfa_match()</b>.
201*22dc650dSSadaf Ebrahimi</P>
202*22dc650dSSadaf Ebrahimi<br><a name="SEC5" href="#TOC1">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br>
203*22dc650dSSadaf Ebrahimi<P>
204*22dc650dSSadaf EbrahimiThe main advantage of the alternative algorithm is that all possible matches
205*22dc650dSSadaf Ebrahimi(at a single point in the subject) are automatically found, and in particular,
206*22dc650dSSadaf Ebrahimithe longest match is found. To find more than one match at the same point using
207*22dc650dSSadaf Ebrahimithe standard algorithm, you have to do kludgy things with callouts.
208*22dc650dSSadaf Ebrahimi</P>
209*22dc650dSSadaf Ebrahimi<P>
210*22dc650dSSadaf EbrahimiPartial matching is possible with this algorithm, though it has some
211*22dc650dSSadaf Ebrahimilimitations. The
212*22dc650dSSadaf Ebrahimi<a href="pcre2partial.html"><b>pcre2partial</b></a>
213*22dc650dSSadaf Ebrahimidocumentation gives details of partial matching and discusses multi-segment
214*22dc650dSSadaf Ebrahimimatching.
215*22dc650dSSadaf Ebrahimi</P>
216*22dc650dSSadaf Ebrahimi<br><a name="SEC6" href="#TOC1">DISADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br>
217*22dc650dSSadaf Ebrahimi<P>
218*22dc650dSSadaf EbrahimiThe alternative algorithm suffers from a number of disadvantages:
219*22dc650dSSadaf Ebrahimi</P>
220*22dc650dSSadaf Ebrahimi<P>
221*22dc650dSSadaf Ebrahimi1. It is substantially slower than the standard algorithm. This is partly
222*22dc650dSSadaf Ebrahimibecause it has to search for all possible matches, but is also because it is
223*22dc650dSSadaf Ebrahimiless susceptible to optimization.
224*22dc650dSSadaf Ebrahimi</P>
225*22dc650dSSadaf Ebrahimi<P>
226*22dc650dSSadaf Ebrahimi2. Capturing parentheses, backreferences, script runs, and matching within
227*22dc650dSSadaf Ebrahimiinvalid UTF string are not supported.
228*22dc650dSSadaf Ebrahimi</P>
229*22dc650dSSadaf Ebrahimi<P>
230*22dc650dSSadaf Ebrahimi3. Although atomic groups are supported, their use does not provide the
231*22dc650dSSadaf Ebrahimiperformance advantage that it does for the standard algorithm.
232*22dc650dSSadaf Ebrahimi</P>
233*22dc650dSSadaf Ebrahimi<P>
234*22dc650dSSadaf Ebrahimi4. JIT optimization is not supported.
235*22dc650dSSadaf Ebrahimi</P>
236*22dc650dSSadaf Ebrahimi<br><a name="SEC7" href="#TOC1">AUTHOR</a><br>
237*22dc650dSSadaf Ebrahimi<P>
238*22dc650dSSadaf EbrahimiPhilip Hazel
239*22dc650dSSadaf Ebrahimi<br>
240*22dc650dSSadaf EbrahimiRetired from University Computing Service
241*22dc650dSSadaf Ebrahimi<br>
242*22dc650dSSadaf EbrahimiCambridge, England.
243*22dc650dSSadaf Ebrahimi<br>
244*22dc650dSSadaf Ebrahimi</P>
245*22dc650dSSadaf Ebrahimi<br><a name="SEC8" href="#TOC1">REVISION</a><br>
246*22dc650dSSadaf Ebrahimi<P>
247*22dc650dSSadaf EbrahimiLast updated: 19 January 2024
248*22dc650dSSadaf Ebrahimi<br>
249*22dc650dSSadaf EbrahimiCopyright &copy; 1997-2024 University of Cambridge.
250*22dc650dSSadaf Ebrahimi<br>
251*22dc650dSSadaf Ebrahimi<p>
252*22dc650dSSadaf EbrahimiReturn to the <a href="index.html">PCRE2 index page</a>.
253*22dc650dSSadaf Ebrahimi</p>
254