xref: /aosp_15_r20/external/pcre/maint/README (revision 22dc650d8ae982c6770746019a6f94af92b0f024)
1*22dc650dSSadaf EbrahimiMAINTENANCE README FOR PCRE2
2*22dc650dSSadaf Ebrahimi============================
3*22dc650dSSadaf Ebrahimi
4*22dc650dSSadaf EbrahimiThe files in the "maint" directory of the PCRE2 source contain data, scripts,
5*22dc650dSSadaf Ebrahimiand programs that are used for the maintenance of PCRE2, but which do not form
6*22dc650dSSadaf Ebrahimipart of the PCRE2 distribution tarballs. This document describes these files
7*22dc650dSSadaf Ebrahimiand also contains some notes for maintainers. Its contents are:
8*22dc650dSSadaf Ebrahimi
9*22dc650dSSadaf Ebrahimi  Files in the maint directory
10*22dc650dSSadaf Ebrahimi  Updating to a new Unicode release
11*22dc650dSSadaf Ebrahimi  Preparing for a PCRE2 release
12*22dc650dSSadaf Ebrahimi  Making a PCRE2 release
13*22dc650dSSadaf Ebrahimi  Long-term ideas (wish list)
14*22dc650dSSadaf Ebrahimi
15*22dc650dSSadaf Ebrahimi
16*22dc650dSSadaf EbrahimiFiles in the maint directory
17*22dc650dSSadaf Ebrahimi============================
18*22dc650dSSadaf Ebrahimi
19*22dc650dSSadaf EbrahimiGenerateCommon.py
20*22dc650dSSadaf Ebrahimi  A Python module containing data and functions that are used by the other
21*22dc650dSSadaf Ebrahimi  Generate scripts.
22*22dc650dSSadaf Ebrahimi
23*22dc650dSSadaf EbrahimiGenerateTest26.py
24*22dc650dSSadaf Ebrahimi  A Python script that generates input and expected output test data for test
25*22dc650dSSadaf Ebrahimi  26, which tests certain aspects of Unicode property support.
26*22dc650dSSadaf Ebrahimi
27*22dc650dSSadaf EbrahimiGenerateUcd.py
28*22dc650dSSadaf Ebrahimi  A Python script that generates the file pcre2_ucd.c from GenerateCommon.py
29*22dc650dSSadaf Ebrahimi  and Unicode data files, which are themselves downloaded from the Unicode web
30*22dc650dSSadaf Ebrahimi  site. The generated file contains the tables for a 2-stage lookup of Unicode
31*22dc650dSSadaf Ebrahimi  properties, along with some auxiliary tables. The script starts with a long
32*22dc650dSSadaf Ebrahimi  comment that gives details of the tables it constructs.
33*22dc650dSSadaf Ebrahimi
34*22dc650dSSadaf EbrahimiGenerateUcpHeader.py
35*22dc650dSSadaf Ebrahimi  A Python script that generates the file pcre2_ucp.h from GenerateCommon.py
36*22dc650dSSadaf Ebrahimi  and Unicode data files. The generated file defines constants for various
37*22dc650dSSadaf Ebrahimi  Unicode property values.
38*22dc650dSSadaf Ebrahimi
39*22dc650dSSadaf EbrahimiGenerateUcpTables.py
40*22dc650dSSadaf Ebrahimi  A Python script that generates the file pcre2_ucptables.c from
41*22dc650dSSadaf Ebrahimi  GenerateCommon.py and Unicode data files. The generated file contains tables
42*22dc650dSSadaf Ebrahimi  for looking up Unicode property names.
43*22dc650dSSadaf Ebrahimi
44*22dc650dSSadaf EbrahimiManyConfigTests
45*22dc650dSSadaf Ebrahimi  A shell script that runs "configure, make, test" a number of times with
46*22dc650dSSadaf Ebrahimi  different configuration settings.
47*22dc650dSSadaf Ebrahimi
48*22dc650dSSadaf Ebrahimipcre2_chartables.c.non-standard
49*22dc650dSSadaf Ebrahimi  This is a set of character tables that came from a Windows system. It has
50*22dc650dSSadaf Ebrahimi  characters greater than 128 that are set as spaces, amongst other things. I
51*22dc650dSSadaf Ebrahimi  kept it so that it can be used for testing from time to time.
52*22dc650dSSadaf Ebrahimi
53*22dc650dSSadaf EbrahimiREADME
54*22dc650dSSadaf Ebrahimi  This file.
55*22dc650dSSadaf Ebrahimi
56*22dc650dSSadaf EbrahimiUnicode.tables
57*22dc650dSSadaf Ebrahimi  The files in this directory were downloaded from the Unicode web site. They
58*22dc650dSSadaf Ebrahimi  contain information about Unicode characters and scripts, and are used by the
59*22dc650dSSadaf Ebrahimi  Generate scripts. There is also UnicodeData.txt, which is no longer used by
60*22dc650dSSadaf Ebrahimi  any script, because it is useful occasionally for manually looking up the
61*22dc650dSSadaf Ebrahimi  details of certain characters. However, note that character names in this
62*22dc650dSSadaf Ebrahimi  file such as "Arabic sign sanah" do NOT mean that the character is in a
63*22dc650dSSadaf Ebrahimi  particular script (in this case, Arabic). Scripts.txt and
64*22dc650dSSadaf Ebrahimi  ScriptExtensions.txt are where to look for script information.
65*22dc650dSSadaf Ebrahimi
66*22dc650dSSadaf Ebrahimiucptest.c
67*22dc650dSSadaf Ebrahimi  A program for testing the Unicode property macros that do lookups in the
68*22dc650dSSadaf Ebrahimi  pcre2_ucd.c data, mainly useful after rebuilding the Unicode property tables.
69*22dc650dSSadaf Ebrahimi  Compile and run this in the "maint" directory (see comments at its head).
70*22dc650dSSadaf Ebrahimi  This program can also be used to find characters with specific properties and
71*22dc650dSSadaf Ebrahimi  to list which properties are supported.
72*22dc650dSSadaf Ebrahimi
73*22dc650dSSadaf Ebrahimiucptestdata
74*22dc650dSSadaf Ebrahimi  A directory containing four files, testinput{1,2} and testoutput{1,2}, for
75*22dc650dSSadaf Ebrahimi  use in conjunction with the ucptest program.
76*22dc650dSSadaf Ebrahimi
77*22dc650dSSadaf Ebrahimiutf8.c
78*22dc650dSSadaf Ebrahimi  A short, freestanding C program for converting a Unicode code point into a
79*22dc650dSSadaf Ebrahimi  sequence of bytes in the UTF-8 encoding, and vice versa. If its argument is a
80*22dc650dSSadaf Ebrahimi  hex number such as 0x1234, it outputs a list of the equivalent UTF-8 bytes.
81*22dc650dSSadaf Ebrahimi  If its argument is a sequence of concatenated UTF-8 bytes (e.g. 12e188b4) it
82*22dc650dSSadaf Ebrahimi  treats them as a UTF-8 string and outputs the equivalent code points in hex.
83*22dc650dSSadaf Ebrahimi  See comments at its head for details.
84*22dc650dSSadaf Ebrahimi
85*22dc650dSSadaf Ebrahimi
86*22dc650dSSadaf EbrahimiUpdating to a new Unicode release
87*22dc650dSSadaf Ebrahimi=================================
88*22dc650dSSadaf Ebrahimi
89*22dc650dSSadaf EbrahimiWhen there is a new release of Unicode, the files in Unicode.tables must be
90*22dc650dSSadaf Ebrahimirefreshed from the web site. Once that is done, the four Python scripts that
91*22dc650dSSadaf Ebrahimigenerate files from the Unicode data can be run from within the "maint"
92*22dc650dSSadaf Ebrahimidirectory.
93*22dc650dSSadaf Ebrahimi
94*22dc650dSSadaf EbrahimiNote: Previously, it was necessary to update lists of scripts and their
95*22dc650dSSadaf Ebrahimiabbreviations by hand before running the Python scripts. This is no longer
96*22dc650dSSadaf Ebrahiminecessary because the scripts have been upgraded to extract this information
97*22dc650dSSadaf Ebrahimithemselves. Also, there used to be explicit lists of scripts in two of the man
98*22dc650dSSadaf Ebrahimipages. This is no longer the case; the pcre2test program can now output a list
99*22dc650dSSadaf Ebrahimiof supported scripts.
100*22dc650dSSadaf Ebrahimi
101*22dc650dSSadaf EbrahimiYou can give an output file name as an argument to the following scripts, but
102*22dc650dSSadaf Ebrahimiby default:
103*22dc650dSSadaf Ebrahimi
104*22dc650dSSadaf EbrahimiGenerateUcd.py        creates pcre2_ucd.c        )
105*22dc650dSSadaf EbrahimiGenerateUcpHeader.py  creates pcre2_ucp.h        ) in the current directory
106*22dc650dSSadaf EbrahimiGenerateUcpTables.py  creates pcre2_ucptables.c  )
107*22dc650dSSadaf Ebrahimi
108*22dc650dSSadaf EbrahimiThese files can be compared against the existing versions in the src directory
109*22dc650dSSadaf Ebrahimito check on any changes before replacing the old files, but you can also
110*22dc650dSSadaf Ebrahimigenerate directly into the final location by running:
111*22dc650dSSadaf Ebrahimi
112*22dc650dSSadaf Ebrahimi./GenerateUcd.py       ../src/pcre2_ucd.c
113*22dc650dSSadaf Ebrahimi./GenerateUcpHeader.py ../src/pcre2_ucp.h
114*22dc650dSSadaf Ebrahimi./GenerateUcpTables.py ../src/pcre2_ucptables.c
115*22dc650dSSadaf Ebrahimi
116*22dc650dSSadaf EbrahimiOnce the .c and .h files are in the ../src directory, the ucptest program can
117*22dc650dSSadaf Ebrahimibe compiled and used to check that the new tables work properly. The data files
118*22dc650dSSadaf Ebrahimiin ucptestdata are set up to check a number of test characters. See the
119*22dc650dSSadaf Ebrahimicomments at the start of ucptest.c. If there are new scripts, adding a few
120*22dc650dSSadaf Ebrahimitests to the files in ucptestdata is a good idea.
121*22dc650dSSadaf Ebrahimi
122*22dc650dSSadaf EbrahimiFinally, you should run the GenerateTest26.py script to regenerate new versions
123*22dc650dSSadaf Ebrahimiof the input and expected output from a series of Unicode property tests that
124*22dc650dSSadaf Ebrahimiare automatically generated from the Unicode data files. By default, the files
125*22dc650dSSadaf Ebrahimiare written to testinput26 and testoutput26 in the current directory, but you
126*22dc650dSSadaf Ebrahimican give an alternative directory name as an argument to the script. These
127*22dc650dSSadaf Ebrahimifiles should eventually be installed in the main testdata directory.
128*22dc650dSSadaf Ebrahimi
129*22dc650dSSadaf Ebrahimi
130*22dc650dSSadaf EbrahimiPreparing for a PCRE2 release
131*22dc650dSSadaf Ebrahimi=============================
132*22dc650dSSadaf Ebrahimi
133*22dc650dSSadaf EbrahimiThis section contains a checklist of things that I do before building a new
134*22dc650dSSadaf Ebrahimirelease.
135*22dc650dSSadaf Ebrahimi
136*22dc650dSSadaf Ebrahimi. Ensure that the version number and version date are correct in configure.ac.
137*22dc650dSSadaf Ebrahimi
138*22dc650dSSadaf Ebrahimi. Update the library version numbers in configure.ac according to the rules
139*22dc650dSSadaf Ebrahimi  given below.
140*22dc650dSSadaf Ebrahimi
141*22dc650dSSadaf Ebrahimi. If new build options or new source files have been added, ensure that they
142*22dc650dSSadaf Ebrahimi  are added to the CMake files as well as to the autoconf files. The relevant
143*22dc650dSSadaf Ebrahimi  files are CMakeLists.txt and config-cmake.h.in. After making a release, test
144*22dc650dSSadaf Ebrahimi  it out with CMake if there have been changes here.
145*22dc650dSSadaf Ebrahimi
146*22dc650dSSadaf Ebrahimi. Run ./autogen.sh to ensure everything is up-to-date.
147*22dc650dSSadaf Ebrahimi
148*22dc650dSSadaf Ebrahimi. Compile and test with many different config options, and combinations of
149*22dc650dSSadaf Ebrahimi  options. Also, test with valgrind by running "RunTest valgrind" and
150*22dc650dSSadaf Ebrahimi  "RunGrepTest valgrind". The script maint/ManyConfigTests now encapsulates
151*22dc650dSSadaf Ebrahimi  this testing. It runs tests with different configurations, and it also runs
152*22dc650dSSadaf Ebrahimi  some of them with valgrind, all of which can take quite some time.
153*22dc650dSSadaf Ebrahimi
154*22dc650dSSadaf Ebrahimi. Run tests in both 32-bit and 64-bit environments if possible. I can no longer
155*22dc650dSSadaf Ebrahimi  run 32-bit tests.
156*22dc650dSSadaf Ebrahimi
157*22dc650dSSadaf Ebrahimi. Run tests with two or more different compilers (e.g. clang and gcc), and
158*22dc650dSSadaf Ebrahimi  make use of -fsanitize=address and friends where possible. For gcc,
159*22dc650dSSadaf Ebrahimi  -fsanitize=undefined -std=gnu99 picks up undefined behaviour at runtime.
160*22dc650dSSadaf Ebrahimi  For clang, -fsanitize=address,undefined,integer can be used but
161*22dc650dSSadaf Ebrahimi  -fno-sanitize=unsigned-integer-overflow must be added when compiling with JIT.
162*22dc650dSSadaf Ebrahimi  Another useful clang option is -fsanitize=signed-integer-overflow
163*22dc650dSSadaf Ebrahimi
164*22dc650dSSadaf Ebrahimi. Do a test build using CMake. Remove src/config.h first, lest it override the
165*22dc650dSSadaf Ebrahimi  version that CMake creates. Also do a CMake unity build to check that it
166*22dc650dSSadaf Ebrahimi  still works: [c]cmake -DCMAKE_UNITY_BUILD=ON sets up a unity build.
167*22dc650dSSadaf Ebrahimi
168*22dc650dSSadaf Ebrahimi. Run perltest.sh on the test data for tests 1 and 4. The output should match
169*22dc650dSSadaf Ebrahimi  the PCRE2 test output, apart from the version identification at the start of
170*22dc650dSSadaf Ebrahimi  each test. Sometimes there are other differences in test 4 if PCRE2 and Perl
171*22dc650dSSadaf Ebrahimi  are using different Unicode releases. The other tests are not Perl-compatible
172*22dc650dSSadaf Ebrahimi  (they use various PCRE2-specific features or options).
173*22dc650dSSadaf Ebrahimi
174*22dc650dSSadaf Ebrahimi. It is possible to test with the emulated memmove() function by undefining
175*22dc650dSSadaf Ebrahimi  HAVE_MEMMOVE and HAVE_BCOPY in config.h, though I do not do this often.
176*22dc650dSSadaf Ebrahimi
177*22dc650dSSadaf Ebrahimi. Documentation: check AUTHORS, ChangeLog (check version and date), LICENCE,
178*22dc650dSSadaf Ebrahimi  NEWS (check version and date), NON-AUTOTOOLS-BUILD, and README. Many of these
179*22dc650dSSadaf Ebrahimi  won't need changing, but over the long term things do change.
180*22dc650dSSadaf Ebrahimi
181*22dc650dSSadaf Ebrahimi. I used to test new releases myself on a number of different operating
182*22dc650dSSadaf Ebrahimi  systems. For example, on Solaris it is helpful to test using Sun's cc
183*22dc650dSSadaf Ebrahimi  compiler as a change from gcc. Adding -xarch=v9 to the cc options does a
184*22dc650dSSadaf Ebrahimi  64-bit test, but it also needs -S 64 for pcre2test to increase the stack size
185*22dc650dSSadaf Ebrahimi  for test 2. Since I retired I can no longer do much of this. There are
186*22dc650dSSadaf Ebrahimi  automated tests under Ubuntu, Alpine, and Windows that are now set up as
187*22dc650dSSadaf Ebrahimi  GitHub actions. Check that they are running clean.
188*22dc650dSSadaf Ebrahimi
189*22dc650dSSadaf Ebrahimi. The buildbots at http://buildfarm.opencsw.org/ do some automated testing
190*22dc650dSSadaf Ebrahimi  of PCRE2 and should also be checked before putting out a release.
191*22dc650dSSadaf Ebrahimi
192*22dc650dSSadaf Ebrahimi
193*22dc650dSSadaf EbrahimiUpdating version info for libtool
194*22dc650dSSadaf Ebrahimi=================================
195*22dc650dSSadaf Ebrahimi
196*22dc650dSSadaf EbrahimiThis set of rules for updating library version information came from a web page
197*22dc650dSSadaf Ebrahimiwhose URL I have forgotten. The version information consists of three parts:
198*22dc650dSSadaf Ebrahimi(current, revision, age).
199*22dc650dSSadaf Ebrahimi
200*22dc650dSSadaf Ebrahimi1. Start with version information of 0:0:0 for each libtool library.
201*22dc650dSSadaf Ebrahimi
202*22dc650dSSadaf Ebrahimi2. Update the version information only immediately before a public release of
203*22dc650dSSadaf Ebrahimi   your software. More frequent updates are unnecessary, and only guarantee
204*22dc650dSSadaf Ebrahimi   that the current interface number gets larger faster.
205*22dc650dSSadaf Ebrahimi
206*22dc650dSSadaf Ebrahimi3. If the library source code has changed at all since the last update, then
207*22dc650dSSadaf Ebrahimi   increment revision; c:r:a becomes c:r+1:a.
208*22dc650dSSadaf Ebrahimi
209*22dc650dSSadaf Ebrahimi4. If any interfaces have been added, removed, or changed since the last
210*22dc650dSSadaf Ebrahimi   update, increment current, and set revision to 0.
211*22dc650dSSadaf Ebrahimi
212*22dc650dSSadaf Ebrahimi5. If any interfaces have been added since the last public release, then
213*22dc650dSSadaf Ebrahimi   increment age.
214*22dc650dSSadaf Ebrahimi
215*22dc650dSSadaf Ebrahimi6. If any interfaces have been removed or changed since the last public
216*22dc650dSSadaf Ebrahimi   release, then set age to 0.
217*22dc650dSSadaf Ebrahimi
218*22dc650dSSadaf EbrahimiThe following explanation may help in understanding the above rules a bit
219*22dc650dSSadaf Ebrahimibetter. Consider that there are three possible kinds of reaction from users to
220*22dc650dSSadaf Ebrahimichanges in a shared library:
221*22dc650dSSadaf Ebrahimi
222*22dc650dSSadaf Ebrahimi1. Programs using the previous version may use the new version as a drop-in
223*22dc650dSSadaf Ebrahimi   replacement, and programs using the new version can also work with the
224*22dc650dSSadaf Ebrahimi   previous one. In other words, no recompiling nor relinking is needed. In
225*22dc650dSSadaf Ebrahimi   this case, increment revision only, don't touch current or age.
226*22dc650dSSadaf Ebrahimi
227*22dc650dSSadaf Ebrahimi2. Programs using the previous version may use the new version as a drop-in
228*22dc650dSSadaf Ebrahimi   replacement, but programs using the new version may use APIs not present in
229*22dc650dSSadaf Ebrahimi   the previous one. In other words, a program linking against the new version
230*22dc650dSSadaf Ebrahimi   may fail if linked against the old version at run time. In this case, set
231*22dc650dSSadaf Ebrahimi   revision to 0, increment current and age.
232*22dc650dSSadaf Ebrahimi
233*22dc650dSSadaf Ebrahimi3. Programs may need to be changed, recompiled, relinked in order to use the
234*22dc650dSSadaf Ebrahimi   new version. Increment current, set revision and age to 0.
235*22dc650dSSadaf Ebrahimi
236*22dc650dSSadaf Ebrahimi
237*22dc650dSSadaf EbrahimiMaking a PCRE2 release
238*22dc650dSSadaf Ebrahimi======================
239*22dc650dSSadaf Ebrahimi
240*22dc650dSSadaf EbrahimiRun PrepareRelease and commit the files that it changes. The first thing this
241*22dc650dSSadaf Ebrahimiscript does is to run CheckMan on the man pages; if it finds any markup errors,
242*22dc650dSSadaf Ebrahimiit reports them and then aborts. Otherwise it removes trailing spaces from
243*22dc650dSSadaf Ebrahimisources and refreshes the HTML documentation. Update the GitHub repository with
244*22dc650dSSadaf Ebrahimi"git push".
245*22dc650dSSadaf Ebrahimi
246*22dc650dSSadaf EbrahimiOnce PrepareRelease has run clean, run "make distcheck" to create the tarballs
247*22dc650dSSadaf Ebrahimiand the zipball. I then sign these files. Double-check with "git status" that
248*22dc650dSSadaf Ebrahimithe repository is fully up-to-date, then create a new tag and a release on
249*22dc650dSSadaf EbrahimiGitHub. Upload the tarballs, zipball, and the signatures as "assets" of the
250*22dc650dSSadaf EbrahimiGitHub release.
251*22dc650dSSadaf Ebrahimi
252*22dc650dSSadaf EbrahimiWhen the new release is out, don't forget to tell [email protected] and the
253*22dc650dSSadaf Ebrahimimailing list.
254*22dc650dSSadaf Ebrahimi
255*22dc650dSSadaf Ebrahimi
256*22dc650dSSadaf EbrahimiFuture ideas (wish list)
257*22dc650dSSadaf Ebrahimi========================
258*22dc650dSSadaf Ebrahimi
259*22dc650dSSadaf EbrahimiThis section records a list of ideas so that they do not get forgotten. They
260*22dc650dSSadaf Ebrahimivary enormously in their usefulness and potential for implementation. Some are
261*22dc650dSSadaf Ebrahimivery sensible; some are rather wacky. Some have been on this list for many
262*22dc650dSSadaf Ebrahimiyears.
263*22dc650dSSadaf Ebrahimi
264*22dc650dSSadaf Ebrahimi. Optimization
265*22dc650dSSadaf Ebrahimi
266*22dc650dSSadaf Ebrahimi  There are always ideas for new optimizations so as to speed up pattern
267*22dc650dSSadaf Ebrahimi  matching. Most of them try to save work by recognizing a non-match without
268*22dc650dSSadaf Ebrahimi  having to scan all the possibilities. These are some that I've recorded:
269*22dc650dSSadaf Ebrahimi
270*22dc650dSSadaf Ebrahimi  * /((A{0,5}){0,5}){0,5}(something complex)/ on a non-matching string is very
271*22dc650dSSadaf Ebrahimi    slow, though Perl is fast. Can we speed up somehow? Convert to {0,125}?
272*22dc650dSSadaf Ebrahimi    OTOH, this is pathological - the user could easily fix it.
273*22dc650dSSadaf Ebrahimi
274*22dc650dSSadaf Ebrahimi  * Turn ={4} into ==== ? (for speed). I once did an experiment, and it seems
275*22dc650dSSadaf Ebrahimi    to have little effect, and maybe makes things worse.
276*22dc650dSSadaf Ebrahimi
277*22dc650dSSadaf Ebrahimi  * "Ends with literal string" - note that a single character doesn't gain much
278*22dc650dSSadaf Ebrahimi    over the existing "required code unit" feature that just remembers one code
279*22dc650dSSadaf Ebrahimi    unit.
280*22dc650dSSadaf Ebrahimi
281*22dc650dSSadaf Ebrahimi  * Remember an initial string rather than just 1 code unit.
282*22dc650dSSadaf Ebrahimi
283*22dc650dSSadaf Ebrahimi  * A required code unit from alternatives - not just the last unit, but an
284*22dc650dSSadaf Ebrahimi    earlier one if common to all alternatives.
285*22dc650dSSadaf Ebrahimi
286*22dc650dSSadaf Ebrahimi  * Friedl contains other ideas.
287*22dc650dSSadaf Ebrahimi
288*22dc650dSSadaf Ebrahimi  * The code does not set initial code unit flags for Unicode property types
289*22dc650dSSadaf Ebrahimi    such as \p; I don't know how much benefit there would be for, for example,
290*22dc650dSSadaf Ebrahimi    setting the bits for 0-9 and all values >= xC0 (in 8-bit mode) when a
291*22dc650dSSadaf Ebrahimi    pattern starts with \p{N}.
292*22dc650dSSadaf Ebrahimi
293*22dc650dSSadaf Ebrahimi. If Perl gets to a consistent state over the settings of capturing sub-
294*22dc650dSSadaf Ebrahimi  patterns inside repeats, see if we can match it. One example of the
295*22dc650dSSadaf Ebrahimi  difference is the matching of /(main(O)?)+/ against mainOmain, where PCRE2
296*22dc650dSSadaf Ebrahimi  leaves $2 set. In Perl, it's unset. Changing this in PCRE2 will be very hard
297*22dc650dSSadaf Ebrahimi  because I think it needs much more state to be remembered.
298*22dc650dSSadaf Ebrahimi
299*22dc650dSSadaf Ebrahimi. A feature to suspend a match via a callout was once requested.
300*22dc650dSSadaf Ebrahimi
301*22dc650dSSadaf Ebrahimi. An option to convert results into character offsets and character lengths.
302*22dc650dSSadaf Ebrahimi
303*22dc650dSSadaf Ebrahimi. A (non-Unix) user wanted pcregrep options to (a) list a file name just once,
304*22dc650dSSadaf Ebrahimi  preceded by a blank line, instead of adding it to every matched line, and (b)
305*22dc650dSSadaf Ebrahimi  support --outputfile=name.
306*22dc650dSSadaf Ebrahimi
307*22dc650dSSadaf Ebrahimi. Define a union for the results from pcre2_pattern_info().
308*22dc650dSSadaf Ebrahimi
309*22dc650dSSadaf Ebrahimi. Provide a "random access to the subject" facility so that the way in which it
310*22dc650dSSadaf Ebrahimi  is stored is independent of PCRE2. For efficiency, it probably isn't possible
311*22dc650dSSadaf Ebrahimi  to switch this dynamically. It would have to be specified when PCRE2 was
312*22dc650dSSadaf Ebrahimi  compiled. PCRE2 would then call a function every time it wanted a character.
313*22dc650dSSadaf Ebrahimi
314*22dc650dSSadaf Ebrahimi. pcre2grep: add -rs for a sorted recurse. Having to store file names and sort
315*22dc650dSSadaf Ebrahimi  them will of course slow it down.
316*22dc650dSSadaf Ebrahimi
317*22dc650dSSadaf Ebrahimi. Someone suggested --disable-callout to save code space when callouts are
318*22dc650dSSadaf Ebrahimi  never wanted. This seems rather marginal.
319*22dc650dSSadaf Ebrahimi
320*22dc650dSSadaf Ebrahimi. A user suggested a parameter to limit the length of string matched, for
321*22dc650dSSadaf Ebrahimi  example if the parameter is N, the current match should fail if the matched
322*22dc650dSSadaf Ebrahimi  substring exceeds N. This could apply to both match functions. The value
323*22dc650dSSadaf Ebrahimi  could be a new field in the match context. Compare the offset_limit feature,
324*22dc650dSSadaf Ebrahimi  which limits where a match must start.
325*22dc650dSSadaf Ebrahimi
326*22dc650dSSadaf Ebrahimi. Write a function that generates random matching strings for a compiled
327*22dc650dSSadaf Ebrahimi  pattern.
328*22dc650dSSadaf Ebrahimi
329*22dc650dSSadaf Ebrahimi. Pcre2grep: an option to specify the output line separator, either as a string
330*22dc650dSSadaf Ebrahimi  or select from a fixed list. This is not straightforward, because at the
331*22dc650dSSadaf Ebrahimi  moment it outputs whatever is in the input file.
332*22dc650dSSadaf Ebrahimi
333*22dc650dSSadaf Ebrahimi. Improve the code for duplicate checking in pcre2_dfa_match(). An incomplete,
334*22dc650dSSadaf Ebrahimi  non-thread-safe patch showed that this can help performance for patterns
335*22dc650dSSadaf Ebrahimi  where there are many alternatives. However, a simple thread-safe
336*22dc650dSSadaf Ebrahimi  implementation that I tried made things worse in many simple cases, so this
337*22dc650dSSadaf Ebrahimi  is not an obviously good thing.
338*22dc650dSSadaf Ebrahimi
339*22dc650dSSadaf Ebrahimi. PCRE2 cannot at present distinguish between subpatterns with different names,
340*22dc650dSSadaf Ebrahimi  but the same number (created by the use of ?|). In order to do so, a way of
341*22dc650dSSadaf Ebrahimi  remembering *which* subpattern numbered n matched is needed. (*MARK) can
342*22dc650dSSadaf Ebrahimi  perhaps be used as a way round this problem. However, note that Perl does not
343*22dc650dSSadaf Ebrahimi  distinguish: like PCRE2, a name is just an alias for a number in Perl.
344*22dc650dSSadaf Ebrahimi
345*22dc650dSSadaf Ebrahimi. Instead of having #ifdef HAVE_CONFIG_H in each module, put #include
346*22dc650dSSadaf Ebrahimi  "something" and the the #ifdef appears only in one place, in "something".
347*22dc650dSSadaf Ebrahimi
348*22dc650dSSadaf Ebrahimi. Implement something like (?(R2+)... to check outer recursions.
349*22dc650dSSadaf Ebrahimi
350*22dc650dSSadaf Ebrahimi. If Perl ever supports the POSIX notation [[.something.]] PCRE2 should try
351*22dc650dSSadaf Ebrahimi  to follow.
352*22dc650dSSadaf Ebrahimi
353*22dc650dSSadaf Ebrahimi. A user wanted a way of ignoring all Unicode "mark" characters so that, for
354*22dc650dSSadaf Ebrahimi  example "a" followed by an accent would, together, match "a". This can only
355*22dc650dSSadaf Ebrahimi  be done clumsily at present by using a lookahead such as /(?=a)\X/, which
356*22dc650dSSadaf Ebrahimi  works for "combining" characters.
357*22dc650dSSadaf Ebrahimi
358*22dc650dSSadaf Ebrahimi. Perl supports [\N{x}-\N{y}] as a Unicode range, even in EBCDIC. PCRE2
359*22dc650dSSadaf Ebrahimi  supports \N{U+dd..} everywhere, but not in EBCDIC.
360*22dc650dSSadaf Ebrahimi
361*22dc650dSSadaf Ebrahimi. Unicode stuff from Perl:
362*22dc650dSSadaf Ebrahimi
363*22dc650dSSadaf Ebrahimi    \b{gcb} or \b{g}    grapheme cluster boundary
364*22dc650dSSadaf Ebrahimi    \b{sb}              sentence boundary
365*22dc650dSSadaf Ebrahimi    \b{wb}              word boundary
366*22dc650dSSadaf Ebrahimi
367*22dc650dSSadaf Ebrahimi  See Unicode TR 29. The last two are very much aimed at natural language.
368*22dc650dSSadaf Ebrahimi
369*22dc650dSSadaf Ebrahimi. Allow a callout to specify a number of characters to skip. This can be done
370*22dc650dSSadaf Ebrahimi  compatibly via an extra callout field.
371*22dc650dSSadaf Ebrahimi
372*22dc650dSSadaf Ebrahimi. Allow callouts to return *PRUNE, *COMMIT, *THEN, *SKIP, with and without
373*22dc650dSSadaf Ebrahimi  continuing (that is, with and without an implied *FAIL). A new option,
374*22dc650dSSadaf Ebrahimi  PCRE2_CALLOUT_EXTENDED say, would be needed. This is unlikely ever to be
375*22dc650dSSadaf Ebrahimi  implemented by JIT, so this could be an option for pcre2_match().
376*22dc650dSSadaf Ebrahimi
377*22dc650dSSadaf Ebrahimi. A limit on substitutions: a user suggested somehow finding a way of making
378*22dc650dSSadaf Ebrahimi  match_limit apply to the whole operation instead of each match separately.
379*22dc650dSSadaf Ebrahimi
380*22dc650dSSadaf Ebrahimi. Some #defines could be replaced with enums to improve robustness.
381*22dc650dSSadaf Ebrahimi
382*22dc650dSSadaf Ebrahimi. There was a request for an option for pcre2_match() to return the longest
383*22dc650dSSadaf Ebrahimi  match. This would mean searching for all possible matches, of course.
384*22dc650dSSadaf Ebrahimi
385*22dc650dSSadaf Ebrahimi. Perl's /a modifier sets Unicode, but restricts \d etc to ASCII characters,
386*22dc650dSSadaf Ebrahimi  which is the PCRE2 default for PCRE2_UTF (use PCRE2_UCP to change). However,
387*22dc650dSSadaf Ebrahimi  Perl also has /aa, which in addition, disables ASCII/non-ASCII caseless
388*22dc650dSSadaf Ebrahimi  matching. Perhaps we need a new option PCRE2_CASELESS_RESTRICT_ASCII. In
389*22dc650dSSadaf Ebrahimi  practice, this just means not using the ucd_caseless_sets[] table.
390*22dc650dSSadaf Ebrahimi
391*22dc650dSSadaf Ebrahimi. There is more that could be done to the oss-fuzz setup (needs some research).
392*22dc650dSSadaf Ebrahimi  A seed corpus could be built. I noted something about $LIB_FUZZING_ENGINE.
393*22dc650dSSadaf Ebrahimi  The test function could make use of get_substrings() to cover more code.
394*22dc650dSSadaf Ebrahimi
395*22dc650dSSadaf Ebrahimi. A neater way of handling recursion file names in pcre2grep, e.g. a single
396*22dc650dSSadaf Ebrahimi  buffer that can grow. See also GitHub issue #2 (recursion looping via
397*22dc650dSSadaf Ebrahimi  symlinks).
398*22dc650dSSadaf Ebrahimi
399*22dc650dSSadaf Ebrahimi. A user suggested that before/after parameters in pcre2grep could have
400*22dc650dSSadaf Ebrahimi  negative values, to list lines near to the matched line, but not necessarily
401*22dc650dSSadaf Ebrahimi  the line itself. For example, --before-context=-1 would list the line *after*
402*22dc650dSSadaf Ebrahimi  each matched line, without showing the matched line. The problem here is what
403*22dc650dSSadaf Ebrahimi  to do with matches that are close together. Maybe a simpler way would be a
404*22dc650dSSadaf Ebrahimi  flag to disable showing matched lines, only valid with either -A or -B?
405*22dc650dSSadaf Ebrahimi
406*22dc650dSSadaf Ebrahimi. There was a suggestiong for a pcre2grep colour default, or possibly a more
407*22dc650dSSadaf Ebrahimi  general PCRE2GREP_OPT, but only for some options - not file names or patterns.
408*22dc650dSSadaf Ebrahimi
409*22dc650dSSadaf Ebrahimi. Breaking loops that match an empty string: perhaps find a way of continuing
410*22dc650dSSadaf Ebrahimi  if *something* has changed, but this might mean remembering additional data.
411*22dc650dSSadaf Ebrahimi  "Something" could be a capture value, but then a list of previous values
412*22dc650dSSadaf Ebrahimi  would be needed to avoid a cycle of changes.
413*22dc650dSSadaf Ebrahimi
414*22dc650dSSadaf Ebrahimi. If a function could be written to find 3-character (or other length) fixed
415*22dc650dSSadaf Ebrahimi  strings, at least one of which must be present for a match, efficient
416*22dc650dSSadaf Ebrahimi  pre-searching of large datasets could be implemented.
417*22dc650dSSadaf Ebrahimi
418*22dc650dSSadaf Ebrahimi. If pcre2grep had --first-line (match only in the first line) it could be
419*22dc650dSSadaf Ebrahimi  efficiently used to find files "starting with xxx". What about --last-line?
420*22dc650dSSadaf Ebrahimi  There was also the suggestion of an option for pcre2grep to scan only the
421*22dc650dSSadaf Ebrahimi  start of a file. I am not keen - this is the job of "head".
422*22dc650dSSadaf Ebrahimi
423*22dc650dSSadaf Ebrahimi. A user requested a means of determining whether a failed match was failed by
424*22dc650dSSadaf Ebrahimi  the start-of-match optimizations, or by running the match engine. Easy enough
425*22dc650dSSadaf Ebrahimi  to define a bit in the match data, but all three matchers would need work.
426*22dc650dSSadaf Ebrahimi
427*22dc650dSSadaf Ebrahimi. Would inlining "simple" recursions provide a useful performance boost for the
428*22dc650dSSadaf Ebrahimi  interpreters? JIT already does some of this, but it may not be worth it for
429*22dc650dSSadaf Ebrahimi  the interpreters.
430*22dc650dSSadaf Ebrahimi
431*22dc650dSSadaf Ebrahimi. Redesign handling of class/nclass/xclass because the compile code logic is
432*22dc650dSSadaf Ebrahimi  currently very contorted and obscure. Also there was a request for a way of
433*22dc650dSSadaf Ebrahimi  re-defining \w (and therefore \W, \b, and \B). An in-pattern sequence such as
434*22dc650dSSadaf Ebrahimi  (?w=[...]) was suggested. Easiest way would be simply to inline the class,
435*22dc650dSSadaf Ebrahimi  with lookarounds for \b and \B. Ideally the setting should last till the end
436*22dc650dSSadaf Ebrahimi  of the group, which means remembering all previous settings; maybe a fixed
437*22dc650dSSadaf Ebrahimi  amount of stack would do - how deep would anyone want to nest these things?
438*22dc650dSSadaf Ebrahimi  See GitHub issue #13 for a compendium of character class issues, including
439*22dc650dSSadaf Ebrahimi  (?[...]) extended classes.
440*22dc650dSSadaf Ebrahimi
441*22dc650dSSadaf Ebrahimi. A user suggested something like --with-build-info to set a build information
442*22dc650dSSadaf Ebrahimi  string that could be retrieved by pcre2_config(). However, there's no
443*22dc650dSSadaf Ebrahimi  facility for a length limit in pcre2_config(), and what would be the
444*22dc650dSSadaf Ebrahimi  encoding?
445*22dc650dSSadaf Ebrahimi
446*22dc650dSSadaf Ebrahimi. Quantified groups with a fixed count currently operate by replicating the
447*22dc650dSSadaf Ebrahimi  group in the compiled bytecode. This may not really matter in these days of
448*22dc650dSSadaf Ebrahimi  gigabyte memory, but perhaps another implementation might be considered.
449*22dc650dSSadaf Ebrahimi  Needs coordination between the interpreters and JIT.
450*22dc650dSSadaf Ebrahimi
451*22dc650dSSadaf Ebrahimi. The POSIX interface is no longer POSIX compatible, because regoff_t is still
452*22dc650dSSadaf Ebrahimi  defined as an int.
453*22dc650dSSadaf Ebrahimi
454*22dc650dSSadaf Ebrahimi. The POSIX interface is not thread safe because it modifies a pcre2_match
455*22dc650dSSadaf Ebrahimi  inside its regex_t while doing matching. A thread safe version that uses
456*22dc650dSSadaf Ebrahimi  a thread local object has been proposed but it will require that the code
457*22dc650dSSadaf Ebrahimi  requires at least C11 compatibility.
458*22dc650dSSadaf Ebrahimi
459*22dc650dSSadaf Ebrahimi. See also any suggestions in the GitHub issues.
460*22dc650dSSadaf Ebrahimi
461*22dc650dSSadaf EbrahimiPhilip Hazel
462*22dc650dSSadaf EbrahimiEmail local part: Philip.Hazel
463*22dc650dSSadaf EbrahimiEmail domain: gmail.com
464*22dc650dSSadaf EbrahimiLast updated: 30 November 2023
465