xref: /aosp_15_r20/external/pcre/doc/html/pcre2partial.html (revision 22dc650d8ae982c6770746019a6f94af92b0f024)
1*22dc650dSSadaf Ebrahimi<html>
2*22dc650dSSadaf Ebrahimi<head>
3*22dc650dSSadaf Ebrahimi<title>pcre2partial specification</title>
4*22dc650dSSadaf Ebrahimi</head>
5*22dc650dSSadaf Ebrahimi<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6*22dc650dSSadaf Ebrahimi<h1>pcre2partial man page</h1>
7*22dc650dSSadaf Ebrahimi<p>
8*22dc650dSSadaf EbrahimiReturn to the <a href="index.html">PCRE2 index page</a>.
9*22dc650dSSadaf Ebrahimi</p>
10*22dc650dSSadaf Ebrahimi<p>
11*22dc650dSSadaf EbrahimiThis page is part of the PCRE2 HTML documentation. It was generated
12*22dc650dSSadaf Ebrahimiautomatically from the original man page. If there is any nonsense in it,
13*22dc650dSSadaf Ebrahimiplease consult the man page, in case the conversion went wrong.
14*22dc650dSSadaf Ebrahimi<br>
15*22dc650dSSadaf Ebrahimi<ul>
16*22dc650dSSadaf Ebrahimi<li><a name="TOC1" href="#SEC1">PARTIAL MATCHING IN PCRE2</a>
17*22dc650dSSadaf Ebrahimi<li><a name="TOC2" href="#SEC2">REQUIREMENTS FOR A PARTIAL MATCH</a>
18*22dc650dSSadaf Ebrahimi<li><a name="TOC3" href="#SEC3">PARTIAL MATCHING USING pcre2_match()</a>
19*22dc650dSSadaf Ebrahimi<li><a name="TOC4" href="#SEC4">MULTI-SEGMENT MATCHING WITH pcre2_match()</a>
20*22dc650dSSadaf Ebrahimi<li><a name="TOC5" href="#SEC5">PARTIAL MATCHING USING pcre2_dfa_match()</a>
21*22dc650dSSadaf Ebrahimi<li><a name="TOC6" href="#SEC6">MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()</a>
22*22dc650dSSadaf Ebrahimi<li><a name="TOC7" href="#SEC7">AUTHOR</a>
23*22dc650dSSadaf Ebrahimi<li><a name="TOC8" href="#SEC8">REVISION</a>
24*22dc650dSSadaf Ebrahimi</ul>
25*22dc650dSSadaf Ebrahimi<br><a name="SEC1" href="#TOC1">PARTIAL MATCHING IN PCRE2</a><br>
26*22dc650dSSadaf Ebrahimi<P>
27*22dc650dSSadaf EbrahimiIn normal use of PCRE2, if there is a match up to the end of a subject string,
28*22dc650dSSadaf Ebrahimibut more characters are needed to match the entire pattern, PCRE2_ERROR_NOMATCH
29*22dc650dSSadaf Ebrahimiis returned, just like any other failing match. There are circumstances where
30*22dc650dSSadaf Ebrahimiit might be helpful to distinguish this "partial match" case.
31*22dc650dSSadaf Ebrahimi</P>
32*22dc650dSSadaf Ebrahimi<P>
33*22dc650dSSadaf EbrahimiOne example is an application where the subject string is very long, and not
34*22dc650dSSadaf Ebrahimiall available at once. The requirement here is to be able to do the matching
35*22dc650dSSadaf Ebrahimisegment by segment, but special action is needed when a matched substring spans
36*22dc650dSSadaf Ebrahimithe boundary between two segments.
37*22dc650dSSadaf Ebrahimi</P>
38*22dc650dSSadaf Ebrahimi<P>
39*22dc650dSSadaf EbrahimiAnother example is checking a user input string as it is typed, to ensure that
40*22dc650dSSadaf Ebrahimiit conforms to a required format. Invalid characters can be immediately
41*22dc650dSSadaf Ebrahimidiagnosed and rejected, giving instant feedback.
42*22dc650dSSadaf Ebrahimi</P>
43*22dc650dSSadaf Ebrahimi<P>
44*22dc650dSSadaf EbrahimiPartial matching is a PCRE2-specific feature; it is not Perl-compatible. It is
45*22dc650dSSadaf Ebrahimirequested by setting one of the PCRE2_PARTIAL_HARD or PCRE2_PARTIAL_SOFT
46*22dc650dSSadaf Ebrahimioptions when calling a matching function. The difference between the two
47*22dc650dSSadaf Ebrahimioptions is whether or not a partial match is preferred to an alternative
48*22dc650dSSadaf Ebrahimicomplete match, though the details differ between the two types of matching
49*22dc650dSSadaf Ebrahimifunction. If both options are set, PCRE2_PARTIAL_HARD takes precedence.
50*22dc650dSSadaf Ebrahimi</P>
51*22dc650dSSadaf Ebrahimi<P>
52*22dc650dSSadaf EbrahimiIf you want to use partial matching with just-in-time optimized code, as well
53*22dc650dSSadaf Ebrahimias setting a partial match option for the matching function, you must also call
54*22dc650dSSadaf Ebrahimi<b>pcre2_jit_compile()</b> with one or both of these options:
55*22dc650dSSadaf Ebrahimi<pre>
56*22dc650dSSadaf Ebrahimi  PCRE2_JIT_PARTIAL_HARD
57*22dc650dSSadaf Ebrahimi  PCRE2_JIT_PARTIAL_SOFT
58*22dc650dSSadaf Ebrahimi</pre>
59*22dc650dSSadaf EbrahimiPCRE2_JIT_COMPLETE should also be set if you are going to run non-partial
60*22dc650dSSadaf Ebrahimimatches on the same pattern. Separate code is compiled for each mode. If the
61*22dc650dSSadaf Ebrahimiappropriate JIT mode has not been compiled, interpretive matching code is used.
62*22dc650dSSadaf Ebrahimi</P>
63*22dc650dSSadaf Ebrahimi<P>
64*22dc650dSSadaf EbrahimiSetting a partial matching option disables two of PCRE2's standard
65*22dc650dSSadaf Ebrahimioptimization hints. PCRE2 remembers the last literal code unit in a pattern,
66*22dc650dSSadaf Ebrahimiand abandons matching immediately if it is not present in the subject string.
67*22dc650dSSadaf EbrahimiThis optimization cannot be used for a subject string that might match only
68*22dc650dSSadaf Ebrahimipartially. PCRE2 also remembers a minimum length of a matching string, and does
69*22dc650dSSadaf Ebrahiminot bother to run the matching function on shorter strings. This optimization
70*22dc650dSSadaf Ebrahimiis also disabled for partial matching.
71*22dc650dSSadaf Ebrahimi</P>
72*22dc650dSSadaf Ebrahimi<br><a name="SEC2" href="#TOC1">REQUIREMENTS FOR A PARTIAL MATCH</a><br>
73*22dc650dSSadaf Ebrahimi<P>
74*22dc650dSSadaf EbrahimiA possible partial match occurs during matching when the end of the subject
75*22dc650dSSadaf Ebrahimistring is reached successfully, but either more characters are needed to
76*22dc650dSSadaf Ebrahimicomplete the match, or the addition of more characters might change what is
77*22dc650dSSadaf Ebrahimimatched.
78*22dc650dSSadaf Ebrahimi</P>
79*22dc650dSSadaf Ebrahimi<P>
80*22dc650dSSadaf EbrahimiExample 1: if the pattern is /abc/ and the subject is "ab", more characters are
81*22dc650dSSadaf Ebrahimidefinitely needed to complete a match. In this case both hard and soft matching
82*22dc650dSSadaf Ebrahimioptions yield a partial match.
83*22dc650dSSadaf Ebrahimi</P>
84*22dc650dSSadaf Ebrahimi<P>
85*22dc650dSSadaf EbrahimiExample 2: if the pattern is /ab+/ and the subject is "ab", a complete match
86*22dc650dSSadaf Ebrahimican be found, but the addition of more characters might change what is
87*22dc650dSSadaf Ebrahimimatched. In this case, only PCRE2_PARTIAL_HARD returns a partial match;
88*22dc650dSSadaf EbrahimiPCRE2_PARTIAL_SOFT returns the complete match.
89*22dc650dSSadaf Ebrahimi</P>
90*22dc650dSSadaf Ebrahimi<P>
91*22dc650dSSadaf EbrahimiOn reaching the end of the subject, when PCRE2_PARTIAL_HARD is set, if the next
92*22dc650dSSadaf Ebrahimipattern item is \z, \Z, \b, \B, or $ there is always a partial match.
93*22dc650dSSadaf EbrahimiOtherwise, for both options, the next pattern item must be one that inspects a
94*22dc650dSSadaf Ebrahimicharacter, and at least one of the following must be true:
95*22dc650dSSadaf Ebrahimi</P>
96*22dc650dSSadaf Ebrahimi<P>
97*22dc650dSSadaf Ebrahimi(1) At least one character has already been inspected. An inspected character
98*22dc650dSSadaf Ebrahimineed not form part of the final matched string; lookbehind assertions and the
99*22dc650dSSadaf Ebrahimi\K escape sequence provide ways of inspecting characters before the start of a
100*22dc650dSSadaf Ebrahimimatched string.
101*22dc650dSSadaf Ebrahimi</P>
102*22dc650dSSadaf Ebrahimi<P>
103*22dc650dSSadaf Ebrahimi(2) The pattern contains one or more lookbehind assertions. This condition
104*22dc650dSSadaf Ebrahimiexists in case there is a lookbehind that inspects characters before the start
105*22dc650dSSadaf Ebrahimiof the match.
106*22dc650dSSadaf Ebrahimi</P>
107*22dc650dSSadaf Ebrahimi<P>
108*22dc650dSSadaf Ebrahimi(3) There is a special case when the whole pattern can match an empty string.
109*22dc650dSSadaf EbrahimiWhen the starting point is at the end of the subject, the empty string match is
110*22dc650dSSadaf Ebrahimia possibility, and if PCRE2_PARTIAL_SOFT is set and neither of the above
111*22dc650dSSadaf Ebrahimiconditions is true, it is returned. However, because adding more characters
112*22dc650dSSadaf Ebrahimimight result in a non-empty match, PCRE2_PARTIAL_HARD returns a partial match,
113*22dc650dSSadaf Ebrahimiwhich in this case means "there is going to be a match at this point, but until
114*22dc650dSSadaf Ebrahimisome more characters are added, we do not know if it will be an empty string or
115*22dc650dSSadaf Ebrahimisomething longer".
116*22dc650dSSadaf Ebrahimi</P>
117*22dc650dSSadaf Ebrahimi<br><a name="SEC3" href="#TOC1">PARTIAL MATCHING USING pcre2_match()</a><br>
118*22dc650dSSadaf Ebrahimi<P>
119*22dc650dSSadaf EbrahimiWhen a partial matching option is set, the result of calling
120*22dc650dSSadaf Ebrahimi<b>pcre2_match()</b> can be one of the following:
121*22dc650dSSadaf Ebrahimi</P>
122*22dc650dSSadaf Ebrahimi<P>
123*22dc650dSSadaf Ebrahimi<b>A successful match</b>
124*22dc650dSSadaf EbrahimiA complete match has been found, starting and ending within this subject.
125*22dc650dSSadaf Ebrahimi</P>
126*22dc650dSSadaf Ebrahimi<P>
127*22dc650dSSadaf Ebrahimi<b>PCRE2_ERROR_NOMATCH</b>
128*22dc650dSSadaf EbrahimiNo match can start anywhere in this subject.
129*22dc650dSSadaf Ebrahimi</P>
130*22dc650dSSadaf Ebrahimi<P>
131*22dc650dSSadaf Ebrahimi<b>PCRE2_ERROR_PARTIAL</b>
132*22dc650dSSadaf EbrahimiAdding more characters may result in a complete match that uses one or more
133*22dc650dSSadaf Ebrahimicharacters from the end of this subject.
134*22dc650dSSadaf Ebrahimi</P>
135*22dc650dSSadaf Ebrahimi<P>
136*22dc650dSSadaf EbrahimiWhen a partial match is returned, the first two elements in the ovector point
137*22dc650dSSadaf Ebrahimito the portion of the subject that was matched, but the values in the rest of
138*22dc650dSSadaf Ebrahimithe ovector are undefined. The appearance of \K in the pattern has no effect
139*22dc650dSSadaf Ebrahimifor a partial match. Consider this pattern:
140*22dc650dSSadaf Ebrahimi<pre>
141*22dc650dSSadaf Ebrahimi  /abc\K123/
142*22dc650dSSadaf Ebrahimi</pre>
143*22dc650dSSadaf EbrahimiIf it is matched against "456abc123xyz" the result is a complete match, and the
144*22dc650dSSadaf Ebrahimiovector defines the matched string as "123", because \K resets the "start of
145*22dc650dSSadaf Ebrahimimatch" point. However, if a partial match is requested and the subject string
146*22dc650dSSadaf Ebrahimiis "456abc12", a partial match is found for the string "abc12", because all
147*22dc650dSSadaf Ebrahimithese characters are needed for a subsequent re-match with additional
148*22dc650dSSadaf Ebrahimicharacters.
149*22dc650dSSadaf Ebrahimi</P>
150*22dc650dSSadaf Ebrahimi<P>
151*22dc650dSSadaf EbrahimiIf there is more than one partial match, the first one that was found provides
152*22dc650dSSadaf Ebrahimithe data that is returned. Consider this pattern:
153*22dc650dSSadaf Ebrahimi<pre>
154*22dc650dSSadaf Ebrahimi  /123\w+X|dogY/
155*22dc650dSSadaf Ebrahimi</pre>
156*22dc650dSSadaf EbrahimiIf this is matched against the subject string "abc123dog", both alternatives
157*22dc650dSSadaf Ebrahimifail to match, but the end of the subject is reached during matching, so
158*22dc650dSSadaf EbrahimiPCRE2_ERROR_PARTIAL is returned. The offsets are set to 3 and 9, identifying
159*22dc650dSSadaf Ebrahimi"123dog" as the first partial match. (In this example, there are two partial
160*22dc650dSSadaf Ebrahimimatches, because "dog" on its own partially matches the second alternative.)
161*22dc650dSSadaf Ebrahimi</P>
162*22dc650dSSadaf Ebrahimi<br><b>
163*22dc650dSSadaf EbrahimiHow a partial match is processed by pcre2_match()
164*22dc650dSSadaf Ebrahimi</b><br>
165*22dc650dSSadaf Ebrahimi<P>
166*22dc650dSSadaf EbrahimiWhat happens when a partial match is identified depends on which of the two
167*22dc650dSSadaf Ebrahimipartial matching options is set.
168*22dc650dSSadaf Ebrahimi</P>
169*22dc650dSSadaf Ebrahimi<P>
170*22dc650dSSadaf EbrahimiIf PCRE2_PARTIAL_HARD is set, PCRE2_ERROR_PARTIAL is returned as soon as a
171*22dc650dSSadaf Ebrahimipartial match is found, without continuing to search for possible complete
172*22dc650dSSadaf Ebrahimimatches. This option is "hard" because it prefers an earlier partial match over
173*22dc650dSSadaf Ebrahimia later complete match. For this reason, the assumption is made that the end of
174*22dc650dSSadaf Ebrahimithe supplied subject string is not the true end of the available data, which is
175*22dc650dSSadaf Ebrahimiwhy \z, \Z, \b, \B, and $ always give a partial match.
176*22dc650dSSadaf Ebrahimi</P>
177*22dc650dSSadaf Ebrahimi<P>
178*22dc650dSSadaf EbrahimiIf PCRE2_PARTIAL_SOFT is set, the partial match is remembered, but matching
179*22dc650dSSadaf Ebrahimicontinues as normal, and other alternatives in the pattern are tried. If no
180*22dc650dSSadaf Ebrahimicomplete match can be found, PCRE2_ERROR_PARTIAL is returned instead of
181*22dc650dSSadaf EbrahimiPCRE2_ERROR_NOMATCH. This option is "soft" because it prefers a complete match
182*22dc650dSSadaf Ebrahimiover a partial match. All the various matching items in a pattern behave as if
183*22dc650dSSadaf Ebrahimithe subject string is potentially complete; \z, \Z, and $ match at the end of
184*22dc650dSSadaf Ebrahimithe subject, as normal, and for \b and \B the end of the subject is treated
185*22dc650dSSadaf Ebrahimias a non-alphanumeric.
186*22dc650dSSadaf Ebrahimi</P>
187*22dc650dSSadaf Ebrahimi<P>
188*22dc650dSSadaf EbrahimiThe difference between the two partial matching options can be illustrated by a
189*22dc650dSSadaf Ebrahimipattern such as:
190*22dc650dSSadaf Ebrahimi<pre>
191*22dc650dSSadaf Ebrahimi  /dog(sbody)?/
192*22dc650dSSadaf Ebrahimi</pre>
193*22dc650dSSadaf EbrahimiThis matches either "dog" or "dogsbody", greedily (that is, it prefers the
194*22dc650dSSadaf Ebrahimilonger string if possible). If it is matched against the string "dog" with
195*22dc650dSSadaf EbrahimiPCRE2_PARTIAL_SOFT, it yields a complete match for "dog". However, if
196*22dc650dSSadaf EbrahimiPCRE2_PARTIAL_HARD is set, the result is PCRE2_ERROR_PARTIAL. On the other
197*22dc650dSSadaf Ebrahimihand, if the pattern is made ungreedy the result is different:
198*22dc650dSSadaf Ebrahimi<pre>
199*22dc650dSSadaf Ebrahimi  /dog(sbody)??/
200*22dc650dSSadaf Ebrahimi</pre>
201*22dc650dSSadaf EbrahimiIn this case the result is always a complete match because that is found first,
202*22dc650dSSadaf Ebrahimiand matching never continues after finding a complete match. It might be easier
203*22dc650dSSadaf Ebrahimito follow this explanation by thinking of the two patterns like this:
204*22dc650dSSadaf Ebrahimi<pre>
205*22dc650dSSadaf Ebrahimi  /dog(sbody)?/    is the same as  /dogsbody|dog/
206*22dc650dSSadaf Ebrahimi  /dog(sbody)??/   is the same as  /dog|dogsbody/
207*22dc650dSSadaf Ebrahimi</pre>
208*22dc650dSSadaf EbrahimiThe second pattern will never match "dogsbody", because it will always find the
209*22dc650dSSadaf Ebrahimishorter match first.
210*22dc650dSSadaf Ebrahimi</P>
211*22dc650dSSadaf Ebrahimi<br><b>
212*22dc650dSSadaf EbrahimiExample of partial matching using pcre2test
213*22dc650dSSadaf Ebrahimi</b><br>
214*22dc650dSSadaf Ebrahimi<P>
215*22dc650dSSadaf EbrahimiThe <b>pcre2test</b> data modifiers <b>partial_hard</b> (or <b>ph</b>) and
216*22dc650dSSadaf Ebrahimi<b>partial_soft</b> (or <b>ps</b>) set PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT,
217*22dc650dSSadaf Ebrahimirespectively, when calling <b>pcre2_match()</b>. Here is a run of
218*22dc650dSSadaf Ebrahimi<b>pcre2test</b> using a pattern that matches the whole subject in the form of a
219*22dc650dSSadaf Ebrahimidate:
220*22dc650dSSadaf Ebrahimi<pre>
221*22dc650dSSadaf Ebrahimi    re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
222*22dc650dSSadaf Ebrahimi  data&#62; 25dec3\=ph
223*22dc650dSSadaf Ebrahimi  Partial match: 23dec3
224*22dc650dSSadaf Ebrahimi  data&#62; 3ju\=ph
225*22dc650dSSadaf Ebrahimi  Partial match: 3ju
226*22dc650dSSadaf Ebrahimi  data&#62; 3juj\=ph
227*22dc650dSSadaf Ebrahimi  No match
228*22dc650dSSadaf Ebrahimi</pre>
229*22dc650dSSadaf EbrahimiThis example gives the same results for both hard and soft partial matching
230*22dc650dSSadaf Ebrahimioptions. Here is an example where there is a difference:
231*22dc650dSSadaf Ebrahimi<pre>
232*22dc650dSSadaf Ebrahimi    re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
233*22dc650dSSadaf Ebrahimi  data&#62; 25jun04\=ps
234*22dc650dSSadaf Ebrahimi   0: 25jun04
235*22dc650dSSadaf Ebrahimi   1: jun
236*22dc650dSSadaf Ebrahimi  data&#62; 25jun04\=ph
237*22dc650dSSadaf Ebrahimi  Partial match: 25jun04
238*22dc650dSSadaf Ebrahimi</pre>
239*22dc650dSSadaf EbrahimiWith PCRE2_PARTIAL_SOFT, the subject is matched completely. For
240*22dc650dSSadaf EbrahimiPCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete, so
241*22dc650dSSadaf Ebrahimithere is only a partial match.
242*22dc650dSSadaf Ebrahimi</P>
243*22dc650dSSadaf Ebrahimi<br><a name="SEC4" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre2_match()</a><br>
244*22dc650dSSadaf Ebrahimi<P>
245*22dc650dSSadaf EbrahimiPCRE was not originally designed with multi-segment matching in mind. However,
246*22dc650dSSadaf Ebrahimiover time, features (including partial matching) that make multi-segment
247*22dc650dSSadaf Ebrahimimatching possible have been added. A very long string can be searched segment
248*22dc650dSSadaf Ebrahimiby segment by calling <b>pcre2_match()</b> repeatedly, with the aim of achieving
249*22dc650dSSadaf Ebrahimithe same results that would happen if the entire string was available for
250*22dc650dSSadaf Ebrahimisearching all the time. Normally, the strings that are being sought are much
251*22dc650dSSadaf Ebrahimishorter than each individual segment, and are in the middle of very long
252*22dc650dSSadaf Ebrahimistrings, so the pattern is normally not anchored.
253*22dc650dSSadaf Ebrahimi</P>
254*22dc650dSSadaf Ebrahimi<P>
255*22dc650dSSadaf EbrahimiSpecial logic must be implemented to handle a matched substring that spans a
256*22dc650dSSadaf Ebrahimisegment boundary. PCRE2_PARTIAL_HARD should be used, because it returns a
257*22dc650dSSadaf Ebrahimipartial match at the end of a segment whenever there is the possibility of
258*22dc650dSSadaf Ebrahimichanging the match by adding more characters. The PCRE2_NOTBOL option should
259*22dc650dSSadaf Ebrahimialso be set for all but the first segment.
260*22dc650dSSadaf Ebrahimi</P>
261*22dc650dSSadaf Ebrahimi<P>
262*22dc650dSSadaf EbrahimiWhen a partial match occurs, the next segment must be added to the current
263*22dc650dSSadaf Ebrahimisubject and the match re-run, using the <i>startoffset</i> argument of
264*22dc650dSSadaf Ebrahimi<b>pcre2_match()</b> to begin at the point where the partial match started.
265*22dc650dSSadaf EbrahimiFor example:
266*22dc650dSSadaf Ebrahimi<pre>
267*22dc650dSSadaf Ebrahimi    re&#62; /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
268*22dc650dSSadaf Ebrahimi  data&#62; ...the date is 23ja\=ph
269*22dc650dSSadaf Ebrahimi  Partial match: 23ja
270*22dc650dSSadaf Ebrahimi  data&#62; ...the date is 23jan19 and on that day...\=offset=15
271*22dc650dSSadaf Ebrahimi   0: 23jan19
272*22dc650dSSadaf Ebrahimi   1: jan
273*22dc650dSSadaf Ebrahimi</pre>
274*22dc650dSSadaf EbrahimiNote the use of the <b>offset</b> modifier to start the new match where the
275*22dc650dSSadaf Ebrahimipartial match was found. In this example, the next segment was added to the one
276*22dc650dSSadaf Ebrahimiin which the partial match was found. This is the most straightforward
277*22dc650dSSadaf Ebrahimiapproach, typically using a memory buffer that is twice the size of each
278*22dc650dSSadaf Ebrahimisegment. After a partial match, the first half of the buffer is discarded, the
279*22dc650dSSadaf Ebrahimisecond half is moved to the start of the buffer, and a new segment is added
280*22dc650dSSadaf Ebrahimibefore repeating the match as in the example above. After a no match, the
281*22dc650dSSadaf Ebrahimientire buffer can be discarded.
282*22dc650dSSadaf Ebrahimi</P>
283*22dc650dSSadaf Ebrahimi<P>
284*22dc650dSSadaf EbrahimiIf there are memory constraints, you may want to discard text that precedes a
285*22dc650dSSadaf Ebrahimipartial match before adding the next segment. Unfortunately, this is not at
286*22dc650dSSadaf Ebrahimipresent straightforward. In cases such as the above, where the pattern does not
287*22dc650dSSadaf Ebrahimicontain any lookbehinds, it is sufficient to retain only the partially matched
288*22dc650dSSadaf Ebrahimisubstring. However, if the pattern contains a lookbehind assertion, characters
289*22dc650dSSadaf Ebrahimithat precede the start of the partial match may have been inspected during the
290*22dc650dSSadaf Ebrahimimatching process. When <b>pcre2test</b> displays a partial match, it indicates
291*22dc650dSSadaf Ebrahimithese characters with '&#60;' if the <b>allusedtext</b> modifier is set:
292*22dc650dSSadaf Ebrahimi<pre>
293*22dc650dSSadaf Ebrahimi    re&#62; "(?&#60;=123)abc"
294*22dc650dSSadaf Ebrahimi  data&#62; xx123ab\=ph,allusedtext
295*22dc650dSSadaf Ebrahimi  Partial match: 123ab
296*22dc650dSSadaf Ebrahimi                 &#60;&#60;&#60;
297*22dc650dSSadaf Ebrahimi</pre>
298*22dc650dSSadaf EbrahimiHowever, the <b>allusedtext</b> modifier is not available for JIT matching,
299*22dc650dSSadaf Ebrahimibecause JIT matching does not record the first (or last) consulted characters.
300*22dc650dSSadaf EbrahimiFor this reason, this information is not available via the API. It is therefore
301*22dc650dSSadaf Ebrahiminot possible in general to obtain the exact number of characters that must be
302*22dc650dSSadaf Ebrahimiretained in order to get the right match result. If you cannot retain the
303*22dc650dSSadaf Ebrahimientire segment, you must find some heuristic way of choosing.
304*22dc650dSSadaf Ebrahimi</P>
305*22dc650dSSadaf Ebrahimi<P>
306*22dc650dSSadaf EbrahimiIf you know the approximate length of the matching substrings, you can use that
307*22dc650dSSadaf Ebrahimito decide how much text to retain. The only lookbehind information that is
308*22dc650dSSadaf Ebrahimicurrently available via the API is the length of the longest individual
309*22dc650dSSadaf Ebrahimilookbehind in a pattern, but this can be misleading if there are nested
310*22dc650dSSadaf Ebrahimilookbehinds. The value returned by calling <b>pcre2_pattern_info()</b> with the
311*22dc650dSSadaf EbrahimiPCRE2_INFO_MAXLOOKBEHIND option is the maximum number of characters (not code
312*22dc650dSSadaf Ebrahimiunits) that any individual lookbehind moves back when it is processed. A
313*22dc650dSSadaf Ebrahimipattern such as "(?&#60;=(?&#60;!b)a)" has a maximum lookbehind value of one, but
314*22dc650dSSadaf Ebrahimiinspects two characters before its starting point.
315*22dc650dSSadaf Ebrahimi</P>
316*22dc650dSSadaf Ebrahimi<P>
317*22dc650dSSadaf EbrahimiIn a non-UTF or a 32-bit case, moving back is just a subtraction, but in
318*22dc650dSSadaf EbrahimiUTF-8 or UTF-16 you have to count characters while moving back through the code
319*22dc650dSSadaf Ebrahimiunits.
320*22dc650dSSadaf Ebrahimi</P>
321*22dc650dSSadaf Ebrahimi<br><a name="SEC5" href="#TOC1">PARTIAL MATCHING USING pcre2_dfa_match()</a><br>
322*22dc650dSSadaf Ebrahimi<P>
323*22dc650dSSadaf EbrahimiThe DFA function moves along the subject string character by character, without
324*22dc650dSSadaf Ebrahimibacktracking, searching for all possible matches simultaneously. If the end of
325*22dc650dSSadaf Ebrahimithe subject is reached before the end of the pattern, there is the possibility
326*22dc650dSSadaf Ebrahimiof a partial match.
327*22dc650dSSadaf Ebrahimi</P>
328*22dc650dSSadaf Ebrahimi<P>
329*22dc650dSSadaf EbrahimiWhen PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if there
330*22dc650dSSadaf Ebrahimihave been no complete matches. Otherwise, the complete matches are returned.
331*22dc650dSSadaf EbrahimiIf PCRE2_PARTIAL_HARD is set, a partial match takes precedence over any
332*22dc650dSSadaf Ebrahimicomplete matches. The portion of the string that was matched when the longest
333*22dc650dSSadaf Ebrahimipartial match was found is set as the first matching string.
334*22dc650dSSadaf Ebrahimi</P>
335*22dc650dSSadaf Ebrahimi<P>
336*22dc650dSSadaf EbrahimiBecause the DFA function always searches for all possible matches, and there is
337*22dc650dSSadaf Ebrahimino difference between greedy and ungreedy repetition, its behaviour is
338*22dc650dSSadaf Ebrahimidifferent from the <b>pcre2_match()</b>. Consider the string "dog" matched
339*22dc650dSSadaf Ebrahimiagainst this ungreedy pattern:
340*22dc650dSSadaf Ebrahimi<pre>
341*22dc650dSSadaf Ebrahimi  /dog(sbody)??/
342*22dc650dSSadaf Ebrahimi</pre>
343*22dc650dSSadaf EbrahimiWhereas the standard function stops as soon as it finds the complete match for
344*22dc650dSSadaf Ebrahimi"dog", the DFA function also finds the partial match for "dogsbody", and so
345*22dc650dSSadaf Ebrahimireturns that when PCRE2_PARTIAL_HARD is set.
346*22dc650dSSadaf Ebrahimi</P>
347*22dc650dSSadaf Ebrahimi<br><a name="SEC6" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()</a><br>
348*22dc650dSSadaf Ebrahimi<P>
349*22dc650dSSadaf EbrahimiWhen a partial match has been found using the DFA matching function, it is
350*22dc650dSSadaf Ebrahimipossible to continue the match by providing additional subject data and calling
351*22dc650dSSadaf Ebrahimithe function again with the same compiled regular expression, this time setting
352*22dc650dSSadaf Ebrahimithe PCRE2_DFA_RESTART option. You must pass the same working space as before,
353*22dc650dSSadaf Ebrahimibecause this is where details of the previous partial match are stored. You can
354*22dc650dSSadaf Ebrahimiset the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with PCRE2_DFA_RESTART
355*22dc650dSSadaf Ebrahimito continue partial matching over multiple segments. Here is an example using
356*22dc650dSSadaf Ebrahimi<b>pcre2test</b>:
357*22dc650dSSadaf Ebrahimi<pre>
358*22dc650dSSadaf Ebrahimi    re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
359*22dc650dSSadaf Ebrahimi  data&#62; 23ja\=dfa,ps
360*22dc650dSSadaf Ebrahimi  Partial match: 23ja
361*22dc650dSSadaf Ebrahimi  data&#62; n05\=dfa,dfa_restart
362*22dc650dSSadaf Ebrahimi   0: n05
363*22dc650dSSadaf Ebrahimi</pre>
364*22dc650dSSadaf EbrahimiThe first call has "23ja" as the subject, and requests partial matching; the
365*22dc650dSSadaf Ebrahimisecond call has "n05" as the subject for the continued (restarted) match.
366*22dc650dSSadaf EbrahimiNotice that when the match is complete, only the last part is shown; PCRE2 does
367*22dc650dSSadaf Ebrahiminot retain the previously partially-matched string. It is up to the calling
368*22dc650dSSadaf Ebrahimiprogram to do that if it needs to. This means that, for an unanchored pattern,
369*22dc650dSSadaf Ebrahimiif a continued match fails, it is not possible to try again at a new starting
370*22dc650dSSadaf Ebrahimipoint. All this facility is capable of doing is continuing with the previous
371*22dc650dSSadaf Ebrahimimatch attempt. For example, consider this pattern:
372*22dc650dSSadaf Ebrahimi<pre>
373*22dc650dSSadaf Ebrahimi  1234|3789
374*22dc650dSSadaf Ebrahimi</pre>
375*22dc650dSSadaf EbrahimiIf the first part of the subject is "ABC123", a partial match of the first
376*22dc650dSSadaf Ebrahimialternative is found at offset 3. There is no partial match for the second
377*22dc650dSSadaf Ebrahimialternative, because such a match does not start at the same point in the
378*22dc650dSSadaf Ebrahimisubject string. Attempting to continue with the string "7890" does not yield a
379*22dc650dSSadaf Ebrahimimatch because only those alternatives that match at one point in the subject
380*22dc650dSSadaf Ebrahimiare remembered. Depending on the application, this may or may not be what you
381*22dc650dSSadaf Ebrahimiwant.
382*22dc650dSSadaf Ebrahimi</P>
383*22dc650dSSadaf Ebrahimi<P>
384*22dc650dSSadaf EbrahimiIf you do want to allow for starting again at the next character, one way of
385*22dc650dSSadaf Ebrahimidoing it is to retain some or all of the segment and try a new complete match,
386*22dc650dSSadaf Ebrahimias described for <b>pcre2_match()</b> above. Another possibility is to work with
387*22dc650dSSadaf Ebrahimitwo buffers. If a partial match at offset <i>n</i> in the first buffer is
388*22dc650dSSadaf Ebrahimifollowed by "no match" when PCRE2_DFA_RESTART is used on the second buffer, you
389*22dc650dSSadaf Ebrahimican then try a new match starting at offset <i>n+1</i> in the first buffer.
390*22dc650dSSadaf Ebrahimi</P>
391*22dc650dSSadaf Ebrahimi<br><a name="SEC7" href="#TOC1">AUTHOR</a><br>
392*22dc650dSSadaf Ebrahimi<P>
393*22dc650dSSadaf EbrahimiPhilip Hazel
394*22dc650dSSadaf Ebrahimi<br>
395*22dc650dSSadaf EbrahimiRetired from University Computing Service
396*22dc650dSSadaf Ebrahimi<br>
397*22dc650dSSadaf EbrahimiCambridge, England.
398*22dc650dSSadaf Ebrahimi<br>
399*22dc650dSSadaf Ebrahimi</P>
400*22dc650dSSadaf Ebrahimi<br><a name="SEC8" href="#TOC1">REVISION</a><br>
401*22dc650dSSadaf Ebrahimi<P>
402*22dc650dSSadaf EbrahimiLast updated: 04 September 2019
403*22dc650dSSadaf Ebrahimi<br>
404*22dc650dSSadaf EbrahimiCopyright &copy; 1997-2019 University of Cambridge.
405*22dc650dSSadaf Ebrahimi<br>
406*22dc650dSSadaf Ebrahimi<p>
407*22dc650dSSadaf EbrahimiReturn to the <a href="index.html">PCRE2 index page</a>.
408*22dc650dSSadaf Ebrahimi</p>
409