1*22dc650dSSadaf EbrahimiMAINTENANCE README FOR PCRE2 2*22dc650dSSadaf Ebrahimi============================ 3*22dc650dSSadaf Ebrahimi 4*22dc650dSSadaf EbrahimiThe files in the "maint" directory of the PCRE2 source contain data, scripts, 5*22dc650dSSadaf Ebrahimiand programs that are used for the maintenance of PCRE2, but which do not form 6*22dc650dSSadaf Ebrahimipart of the PCRE2 distribution tarballs. This document describes these files 7*22dc650dSSadaf Ebrahimiand also contains some notes for maintainers. Its contents are: 8*22dc650dSSadaf Ebrahimi 9*22dc650dSSadaf Ebrahimi Files in the maint directory 10*22dc650dSSadaf Ebrahimi Updating to a new Unicode release 11*22dc650dSSadaf Ebrahimi Preparing for a PCRE2 release 12*22dc650dSSadaf Ebrahimi Making a PCRE2 release 13*22dc650dSSadaf Ebrahimi Long-term ideas (wish list) 14*22dc650dSSadaf Ebrahimi 15*22dc650dSSadaf Ebrahimi 16*22dc650dSSadaf EbrahimiFiles in the maint directory 17*22dc650dSSadaf Ebrahimi============================ 18*22dc650dSSadaf Ebrahimi 19*22dc650dSSadaf EbrahimiGenerateCommon.py 20*22dc650dSSadaf Ebrahimi A Python module containing data and functions that are used by the other 21*22dc650dSSadaf Ebrahimi Generate scripts. 22*22dc650dSSadaf Ebrahimi 23*22dc650dSSadaf EbrahimiGenerateTest26.py 24*22dc650dSSadaf Ebrahimi A Python script that generates input and expected output test data for test 25*22dc650dSSadaf Ebrahimi 26, which tests certain aspects of Unicode property support. 26*22dc650dSSadaf Ebrahimi 27*22dc650dSSadaf EbrahimiGenerateUcd.py 28*22dc650dSSadaf Ebrahimi A Python script that generates the file pcre2_ucd.c from GenerateCommon.py 29*22dc650dSSadaf Ebrahimi and Unicode data files, which are themselves downloaded from the Unicode web 30*22dc650dSSadaf Ebrahimi site. The generated file contains the tables for a 2-stage lookup of Unicode 31*22dc650dSSadaf Ebrahimi properties, along with some auxiliary tables. The script starts with a long 32*22dc650dSSadaf Ebrahimi comment that gives details of the tables it constructs. 33*22dc650dSSadaf Ebrahimi 34*22dc650dSSadaf EbrahimiGenerateUcpHeader.py 35*22dc650dSSadaf Ebrahimi A Python script that generates the file pcre2_ucp.h from GenerateCommon.py 36*22dc650dSSadaf Ebrahimi and Unicode data files. The generated file defines constants for various 37*22dc650dSSadaf Ebrahimi Unicode property values. 38*22dc650dSSadaf Ebrahimi 39*22dc650dSSadaf EbrahimiGenerateUcpTables.py 40*22dc650dSSadaf Ebrahimi A Python script that generates the file pcre2_ucptables.c from 41*22dc650dSSadaf Ebrahimi GenerateCommon.py and Unicode data files. The generated file contains tables 42*22dc650dSSadaf Ebrahimi for looking up Unicode property names. 43*22dc650dSSadaf Ebrahimi 44*22dc650dSSadaf EbrahimiManyConfigTests 45*22dc650dSSadaf Ebrahimi A shell script that runs "configure, make, test" a number of times with 46*22dc650dSSadaf Ebrahimi different configuration settings. 47*22dc650dSSadaf Ebrahimi 48*22dc650dSSadaf Ebrahimipcre2_chartables.c.non-standard 49*22dc650dSSadaf Ebrahimi This is a set of character tables that came from a Windows system. It has 50*22dc650dSSadaf Ebrahimi characters greater than 128 that are set as spaces, amongst other things. I 51*22dc650dSSadaf Ebrahimi kept it so that it can be used for testing from time to time. 52*22dc650dSSadaf Ebrahimi 53*22dc650dSSadaf EbrahimiREADME 54*22dc650dSSadaf Ebrahimi This file. 55*22dc650dSSadaf Ebrahimi 56*22dc650dSSadaf EbrahimiUnicode.tables 57*22dc650dSSadaf Ebrahimi The files in this directory were downloaded from the Unicode web site. They 58*22dc650dSSadaf Ebrahimi contain information about Unicode characters and scripts, and are used by the 59*22dc650dSSadaf Ebrahimi Generate scripts. There is also UnicodeData.txt, which is no longer used by 60*22dc650dSSadaf Ebrahimi any script, because it is useful occasionally for manually looking up the 61*22dc650dSSadaf Ebrahimi details of certain characters. However, note that character names in this 62*22dc650dSSadaf Ebrahimi file such as "Arabic sign sanah" do NOT mean that the character is in a 63*22dc650dSSadaf Ebrahimi particular script (in this case, Arabic). Scripts.txt and 64*22dc650dSSadaf Ebrahimi ScriptExtensions.txt are where to look for script information. 65*22dc650dSSadaf Ebrahimi 66*22dc650dSSadaf Ebrahimiucptest.c 67*22dc650dSSadaf Ebrahimi A program for testing the Unicode property macros that do lookups in the 68*22dc650dSSadaf Ebrahimi pcre2_ucd.c data, mainly useful after rebuilding the Unicode property tables. 69*22dc650dSSadaf Ebrahimi Compile and run this in the "maint" directory (see comments at its head). 70*22dc650dSSadaf Ebrahimi This program can also be used to find characters with specific properties and 71*22dc650dSSadaf Ebrahimi to list which properties are supported. 72*22dc650dSSadaf Ebrahimi 73*22dc650dSSadaf Ebrahimiucptestdata 74*22dc650dSSadaf Ebrahimi A directory containing four files, testinput{1,2} and testoutput{1,2}, for 75*22dc650dSSadaf Ebrahimi use in conjunction with the ucptest program. 76*22dc650dSSadaf Ebrahimi 77*22dc650dSSadaf Ebrahimiutf8.c 78*22dc650dSSadaf Ebrahimi A short, freestanding C program for converting a Unicode code point into a 79*22dc650dSSadaf Ebrahimi sequence of bytes in the UTF-8 encoding, and vice versa. If its argument is a 80*22dc650dSSadaf Ebrahimi hex number such as 0x1234, it outputs a list of the equivalent UTF-8 bytes. 81*22dc650dSSadaf Ebrahimi If its argument is a sequence of concatenated UTF-8 bytes (e.g. 12e188b4) it 82*22dc650dSSadaf Ebrahimi treats them as a UTF-8 string and outputs the equivalent code points in hex. 83*22dc650dSSadaf Ebrahimi See comments at its head for details. 84*22dc650dSSadaf Ebrahimi 85*22dc650dSSadaf Ebrahimi 86*22dc650dSSadaf EbrahimiUpdating to a new Unicode release 87*22dc650dSSadaf Ebrahimi================================= 88*22dc650dSSadaf Ebrahimi 89*22dc650dSSadaf EbrahimiWhen there is a new release of Unicode, the files in Unicode.tables must be 90*22dc650dSSadaf Ebrahimirefreshed from the web site. Once that is done, the four Python scripts that 91*22dc650dSSadaf Ebrahimigenerate files from the Unicode data can be run from within the "maint" 92*22dc650dSSadaf Ebrahimidirectory. 93*22dc650dSSadaf Ebrahimi 94*22dc650dSSadaf EbrahimiNote: Previously, it was necessary to update lists of scripts and their 95*22dc650dSSadaf Ebrahimiabbreviations by hand before running the Python scripts. This is no longer 96*22dc650dSSadaf Ebrahiminecessary because the scripts have been upgraded to extract this information 97*22dc650dSSadaf Ebrahimithemselves. Also, there used to be explicit lists of scripts in two of the man 98*22dc650dSSadaf Ebrahimipages. This is no longer the case; the pcre2test program can now output a list 99*22dc650dSSadaf Ebrahimiof supported scripts. 100*22dc650dSSadaf Ebrahimi 101*22dc650dSSadaf EbrahimiYou can give an output file name as an argument to the following scripts, but 102*22dc650dSSadaf Ebrahimiby default: 103*22dc650dSSadaf Ebrahimi 104*22dc650dSSadaf EbrahimiGenerateUcd.py creates pcre2_ucd.c ) 105*22dc650dSSadaf EbrahimiGenerateUcpHeader.py creates pcre2_ucp.h ) in the current directory 106*22dc650dSSadaf EbrahimiGenerateUcpTables.py creates pcre2_ucptables.c ) 107*22dc650dSSadaf Ebrahimi 108*22dc650dSSadaf EbrahimiThese files can be compared against the existing versions in the src directory 109*22dc650dSSadaf Ebrahimito check on any changes before replacing the old files, but you can also 110*22dc650dSSadaf Ebrahimigenerate directly into the final location by running: 111*22dc650dSSadaf Ebrahimi 112*22dc650dSSadaf Ebrahimi./GenerateUcd.py ../src/pcre2_ucd.c 113*22dc650dSSadaf Ebrahimi./GenerateUcpHeader.py ../src/pcre2_ucp.h 114*22dc650dSSadaf Ebrahimi./GenerateUcpTables.py ../src/pcre2_ucptables.c 115*22dc650dSSadaf Ebrahimi 116*22dc650dSSadaf EbrahimiOnce the .c and .h files are in the ../src directory, the ucptest program can 117*22dc650dSSadaf Ebrahimibe compiled and used to check that the new tables work properly. The data files 118*22dc650dSSadaf Ebrahimiin ucptestdata are set up to check a number of test characters. See the 119*22dc650dSSadaf Ebrahimicomments at the start of ucptest.c. If there are new scripts, adding a few 120*22dc650dSSadaf Ebrahimitests to the files in ucptestdata is a good idea. 121*22dc650dSSadaf Ebrahimi 122*22dc650dSSadaf EbrahimiFinally, you should run the GenerateTest26.py script to regenerate new versions 123*22dc650dSSadaf Ebrahimiof the input and expected output from a series of Unicode property tests that 124*22dc650dSSadaf Ebrahimiare automatically generated from the Unicode data files. By default, the files 125*22dc650dSSadaf Ebrahimiare written to testinput26 and testoutput26 in the current directory, but you 126*22dc650dSSadaf Ebrahimican give an alternative directory name as an argument to the script. These 127*22dc650dSSadaf Ebrahimifiles should eventually be installed in the main testdata directory. 128*22dc650dSSadaf Ebrahimi 129*22dc650dSSadaf Ebrahimi 130*22dc650dSSadaf EbrahimiPreparing for a PCRE2 release 131*22dc650dSSadaf Ebrahimi============================= 132*22dc650dSSadaf Ebrahimi 133*22dc650dSSadaf EbrahimiThis section contains a checklist of things that I do before building a new 134*22dc650dSSadaf Ebrahimirelease. 135*22dc650dSSadaf Ebrahimi 136*22dc650dSSadaf Ebrahimi. Ensure that the version number and version date are correct in configure.ac. 137*22dc650dSSadaf Ebrahimi 138*22dc650dSSadaf Ebrahimi. Update the library version numbers in configure.ac according to the rules 139*22dc650dSSadaf Ebrahimi given below. 140*22dc650dSSadaf Ebrahimi 141*22dc650dSSadaf Ebrahimi. If new build options or new source files have been added, ensure that they 142*22dc650dSSadaf Ebrahimi are added to the CMake files as well as to the autoconf files. The relevant 143*22dc650dSSadaf Ebrahimi files are CMakeLists.txt and config-cmake.h.in. After making a release, test 144*22dc650dSSadaf Ebrahimi it out with CMake if there have been changes here. 145*22dc650dSSadaf Ebrahimi 146*22dc650dSSadaf Ebrahimi. Run ./autogen.sh to ensure everything is up-to-date. 147*22dc650dSSadaf Ebrahimi 148*22dc650dSSadaf Ebrahimi. Compile and test with many different config options, and combinations of 149*22dc650dSSadaf Ebrahimi options. Also, test with valgrind by running "RunTest valgrind" and 150*22dc650dSSadaf Ebrahimi "RunGrepTest valgrind". The script maint/ManyConfigTests now encapsulates 151*22dc650dSSadaf Ebrahimi this testing. It runs tests with different configurations, and it also runs 152*22dc650dSSadaf Ebrahimi some of them with valgrind, all of which can take quite some time. 153*22dc650dSSadaf Ebrahimi 154*22dc650dSSadaf Ebrahimi. Run tests in both 32-bit and 64-bit environments if possible. I can no longer 155*22dc650dSSadaf Ebrahimi run 32-bit tests. 156*22dc650dSSadaf Ebrahimi 157*22dc650dSSadaf Ebrahimi. Run tests with two or more different compilers (e.g. clang and gcc), and 158*22dc650dSSadaf Ebrahimi make use of -fsanitize=address and friends where possible. For gcc, 159*22dc650dSSadaf Ebrahimi -fsanitize=undefined -std=gnu99 picks up undefined behaviour at runtime. 160*22dc650dSSadaf Ebrahimi For clang, -fsanitize=address,undefined,integer can be used but 161*22dc650dSSadaf Ebrahimi -fno-sanitize=unsigned-integer-overflow must be added when compiling with JIT. 162*22dc650dSSadaf Ebrahimi Another useful clang option is -fsanitize=signed-integer-overflow 163*22dc650dSSadaf Ebrahimi 164*22dc650dSSadaf Ebrahimi. Do a test build using CMake. Remove src/config.h first, lest it override the 165*22dc650dSSadaf Ebrahimi version that CMake creates. Also do a CMake unity build to check that it 166*22dc650dSSadaf Ebrahimi still works: [c]cmake -DCMAKE_UNITY_BUILD=ON sets up a unity build. 167*22dc650dSSadaf Ebrahimi 168*22dc650dSSadaf Ebrahimi. Run perltest.sh on the test data for tests 1 and 4. The output should match 169*22dc650dSSadaf Ebrahimi the PCRE2 test output, apart from the version identification at the start of 170*22dc650dSSadaf Ebrahimi each test. Sometimes there are other differences in test 4 if PCRE2 and Perl 171*22dc650dSSadaf Ebrahimi are using different Unicode releases. The other tests are not Perl-compatible 172*22dc650dSSadaf Ebrahimi (they use various PCRE2-specific features or options). 173*22dc650dSSadaf Ebrahimi 174*22dc650dSSadaf Ebrahimi. It is possible to test with the emulated memmove() function by undefining 175*22dc650dSSadaf Ebrahimi HAVE_MEMMOVE and HAVE_BCOPY in config.h, though I do not do this often. 176*22dc650dSSadaf Ebrahimi 177*22dc650dSSadaf Ebrahimi. Documentation: check AUTHORS, ChangeLog (check version and date), LICENCE, 178*22dc650dSSadaf Ebrahimi NEWS (check version and date), NON-AUTOTOOLS-BUILD, and README. Many of these 179*22dc650dSSadaf Ebrahimi won't need changing, but over the long term things do change. 180*22dc650dSSadaf Ebrahimi 181*22dc650dSSadaf Ebrahimi. I used to test new releases myself on a number of different operating 182*22dc650dSSadaf Ebrahimi systems. For example, on Solaris it is helpful to test using Sun's cc 183*22dc650dSSadaf Ebrahimi compiler as a change from gcc. Adding -xarch=v9 to the cc options does a 184*22dc650dSSadaf Ebrahimi 64-bit test, but it also needs -S 64 for pcre2test to increase the stack size 185*22dc650dSSadaf Ebrahimi for test 2. Since I retired I can no longer do much of this. There are 186*22dc650dSSadaf Ebrahimi automated tests under Ubuntu, Alpine, and Windows that are now set up as 187*22dc650dSSadaf Ebrahimi GitHub actions. Check that they are running clean. 188*22dc650dSSadaf Ebrahimi 189*22dc650dSSadaf Ebrahimi. The buildbots at http://buildfarm.opencsw.org/ do some automated testing 190*22dc650dSSadaf Ebrahimi of PCRE2 and should also be checked before putting out a release. 191*22dc650dSSadaf Ebrahimi 192*22dc650dSSadaf Ebrahimi 193*22dc650dSSadaf EbrahimiUpdating version info for libtool 194*22dc650dSSadaf Ebrahimi================================= 195*22dc650dSSadaf Ebrahimi 196*22dc650dSSadaf EbrahimiThis set of rules for updating library version information came from a web page 197*22dc650dSSadaf Ebrahimiwhose URL I have forgotten. The version information consists of three parts: 198*22dc650dSSadaf Ebrahimi(current, revision, age). 199*22dc650dSSadaf Ebrahimi 200*22dc650dSSadaf Ebrahimi1. Start with version information of 0:0:0 for each libtool library. 201*22dc650dSSadaf Ebrahimi 202*22dc650dSSadaf Ebrahimi2. Update the version information only immediately before a public release of 203*22dc650dSSadaf Ebrahimi your software. More frequent updates are unnecessary, and only guarantee 204*22dc650dSSadaf Ebrahimi that the current interface number gets larger faster. 205*22dc650dSSadaf Ebrahimi 206*22dc650dSSadaf Ebrahimi3. If the library source code has changed at all since the last update, then 207*22dc650dSSadaf Ebrahimi increment revision; c:r:a becomes c:r+1:a. 208*22dc650dSSadaf Ebrahimi 209*22dc650dSSadaf Ebrahimi4. If any interfaces have been added, removed, or changed since the last 210*22dc650dSSadaf Ebrahimi update, increment current, and set revision to 0. 211*22dc650dSSadaf Ebrahimi 212*22dc650dSSadaf Ebrahimi5. If any interfaces have been added since the last public release, then 213*22dc650dSSadaf Ebrahimi increment age. 214*22dc650dSSadaf Ebrahimi 215*22dc650dSSadaf Ebrahimi6. If any interfaces have been removed or changed since the last public 216*22dc650dSSadaf Ebrahimi release, then set age to 0. 217*22dc650dSSadaf Ebrahimi 218*22dc650dSSadaf EbrahimiThe following explanation may help in understanding the above rules a bit 219*22dc650dSSadaf Ebrahimibetter. Consider that there are three possible kinds of reaction from users to 220*22dc650dSSadaf Ebrahimichanges in a shared library: 221*22dc650dSSadaf Ebrahimi 222*22dc650dSSadaf Ebrahimi1. Programs using the previous version may use the new version as a drop-in 223*22dc650dSSadaf Ebrahimi replacement, and programs using the new version can also work with the 224*22dc650dSSadaf Ebrahimi previous one. In other words, no recompiling nor relinking is needed. In 225*22dc650dSSadaf Ebrahimi this case, increment revision only, don't touch current or age. 226*22dc650dSSadaf Ebrahimi 227*22dc650dSSadaf Ebrahimi2. Programs using the previous version may use the new version as a drop-in 228*22dc650dSSadaf Ebrahimi replacement, but programs using the new version may use APIs not present in 229*22dc650dSSadaf Ebrahimi the previous one. In other words, a program linking against the new version 230*22dc650dSSadaf Ebrahimi may fail if linked against the old version at run time. In this case, set 231*22dc650dSSadaf Ebrahimi revision to 0, increment current and age. 232*22dc650dSSadaf Ebrahimi 233*22dc650dSSadaf Ebrahimi3. Programs may need to be changed, recompiled, relinked in order to use the 234*22dc650dSSadaf Ebrahimi new version. Increment current, set revision and age to 0. 235*22dc650dSSadaf Ebrahimi 236*22dc650dSSadaf Ebrahimi 237*22dc650dSSadaf EbrahimiMaking a PCRE2 release 238*22dc650dSSadaf Ebrahimi====================== 239*22dc650dSSadaf Ebrahimi 240*22dc650dSSadaf EbrahimiRun PrepareRelease and commit the files that it changes. The first thing this 241*22dc650dSSadaf Ebrahimiscript does is to run CheckMan on the man pages; if it finds any markup errors, 242*22dc650dSSadaf Ebrahimiit reports them and then aborts. Otherwise it removes trailing spaces from 243*22dc650dSSadaf Ebrahimisources and refreshes the HTML documentation. Update the GitHub repository with 244*22dc650dSSadaf Ebrahimi"git push". 245*22dc650dSSadaf Ebrahimi 246*22dc650dSSadaf EbrahimiOnce PrepareRelease has run clean, run "make distcheck" to create the tarballs 247*22dc650dSSadaf Ebrahimiand the zipball. I then sign these files. Double-check with "git status" that 248*22dc650dSSadaf Ebrahimithe repository is fully up-to-date, then create a new tag and a release on 249*22dc650dSSadaf EbrahimiGitHub. Upload the tarballs, zipball, and the signatures as "assets" of the 250*22dc650dSSadaf EbrahimiGitHub release. 251*22dc650dSSadaf Ebrahimi 252*22dc650dSSadaf EbrahimiWhen the new release is out, don't forget to tell [email protected] and the 253*22dc650dSSadaf Ebrahimimailing list. 254*22dc650dSSadaf Ebrahimi 255*22dc650dSSadaf Ebrahimi 256*22dc650dSSadaf EbrahimiFuture ideas (wish list) 257*22dc650dSSadaf Ebrahimi======================== 258*22dc650dSSadaf Ebrahimi 259*22dc650dSSadaf EbrahimiThis section records a list of ideas so that they do not get forgotten. They 260*22dc650dSSadaf Ebrahimivary enormously in their usefulness and potential for implementation. Some are 261*22dc650dSSadaf Ebrahimivery sensible; some are rather wacky. Some have been on this list for many 262*22dc650dSSadaf Ebrahimiyears. 263*22dc650dSSadaf Ebrahimi 264*22dc650dSSadaf Ebrahimi. Optimization 265*22dc650dSSadaf Ebrahimi 266*22dc650dSSadaf Ebrahimi There are always ideas for new optimizations so as to speed up pattern 267*22dc650dSSadaf Ebrahimi matching. Most of them try to save work by recognizing a non-match without 268*22dc650dSSadaf Ebrahimi having to scan all the possibilities. These are some that I've recorded: 269*22dc650dSSadaf Ebrahimi 270*22dc650dSSadaf Ebrahimi * /((A{0,5}){0,5}){0,5}(something complex)/ on a non-matching string is very 271*22dc650dSSadaf Ebrahimi slow, though Perl is fast. Can we speed up somehow? Convert to {0,125}? 272*22dc650dSSadaf Ebrahimi OTOH, this is pathological - the user could easily fix it. 273*22dc650dSSadaf Ebrahimi 274*22dc650dSSadaf Ebrahimi * Turn ={4} into ==== ? (for speed). I once did an experiment, and it seems 275*22dc650dSSadaf Ebrahimi to have little effect, and maybe makes things worse. 276*22dc650dSSadaf Ebrahimi 277*22dc650dSSadaf Ebrahimi * "Ends with literal string" - note that a single character doesn't gain much 278*22dc650dSSadaf Ebrahimi over the existing "required code unit" feature that just remembers one code 279*22dc650dSSadaf Ebrahimi unit. 280*22dc650dSSadaf Ebrahimi 281*22dc650dSSadaf Ebrahimi * Remember an initial string rather than just 1 code unit. 282*22dc650dSSadaf Ebrahimi 283*22dc650dSSadaf Ebrahimi * A required code unit from alternatives - not just the last unit, but an 284*22dc650dSSadaf Ebrahimi earlier one if common to all alternatives. 285*22dc650dSSadaf Ebrahimi 286*22dc650dSSadaf Ebrahimi * Friedl contains other ideas. 287*22dc650dSSadaf Ebrahimi 288*22dc650dSSadaf Ebrahimi * The code does not set initial code unit flags for Unicode property types 289*22dc650dSSadaf Ebrahimi such as \p; I don't know how much benefit there would be for, for example, 290*22dc650dSSadaf Ebrahimi setting the bits for 0-9 and all values >= xC0 (in 8-bit mode) when a 291*22dc650dSSadaf Ebrahimi pattern starts with \p{N}. 292*22dc650dSSadaf Ebrahimi 293*22dc650dSSadaf Ebrahimi. If Perl gets to a consistent state over the settings of capturing sub- 294*22dc650dSSadaf Ebrahimi patterns inside repeats, see if we can match it. One example of the 295*22dc650dSSadaf Ebrahimi difference is the matching of /(main(O)?)+/ against mainOmain, where PCRE2 296*22dc650dSSadaf Ebrahimi leaves $2 set. In Perl, it's unset. Changing this in PCRE2 will be very hard 297*22dc650dSSadaf Ebrahimi because I think it needs much more state to be remembered. 298*22dc650dSSadaf Ebrahimi 299*22dc650dSSadaf Ebrahimi. A feature to suspend a match via a callout was once requested. 300*22dc650dSSadaf Ebrahimi 301*22dc650dSSadaf Ebrahimi. An option to convert results into character offsets and character lengths. 302*22dc650dSSadaf Ebrahimi 303*22dc650dSSadaf Ebrahimi. A (non-Unix) user wanted pcregrep options to (a) list a file name just once, 304*22dc650dSSadaf Ebrahimi preceded by a blank line, instead of adding it to every matched line, and (b) 305*22dc650dSSadaf Ebrahimi support --outputfile=name. 306*22dc650dSSadaf Ebrahimi 307*22dc650dSSadaf Ebrahimi. Define a union for the results from pcre2_pattern_info(). 308*22dc650dSSadaf Ebrahimi 309*22dc650dSSadaf Ebrahimi. Provide a "random access to the subject" facility so that the way in which it 310*22dc650dSSadaf Ebrahimi is stored is independent of PCRE2. For efficiency, it probably isn't possible 311*22dc650dSSadaf Ebrahimi to switch this dynamically. It would have to be specified when PCRE2 was 312*22dc650dSSadaf Ebrahimi compiled. PCRE2 would then call a function every time it wanted a character. 313*22dc650dSSadaf Ebrahimi 314*22dc650dSSadaf Ebrahimi. pcre2grep: add -rs for a sorted recurse. Having to store file names and sort 315*22dc650dSSadaf Ebrahimi them will of course slow it down. 316*22dc650dSSadaf Ebrahimi 317*22dc650dSSadaf Ebrahimi. Someone suggested --disable-callout to save code space when callouts are 318*22dc650dSSadaf Ebrahimi never wanted. This seems rather marginal. 319*22dc650dSSadaf Ebrahimi 320*22dc650dSSadaf Ebrahimi. A user suggested a parameter to limit the length of string matched, for 321*22dc650dSSadaf Ebrahimi example if the parameter is N, the current match should fail if the matched 322*22dc650dSSadaf Ebrahimi substring exceeds N. This could apply to both match functions. The value 323*22dc650dSSadaf Ebrahimi could be a new field in the match context. Compare the offset_limit feature, 324*22dc650dSSadaf Ebrahimi which limits where a match must start. 325*22dc650dSSadaf Ebrahimi 326*22dc650dSSadaf Ebrahimi. Write a function that generates random matching strings for a compiled 327*22dc650dSSadaf Ebrahimi pattern. 328*22dc650dSSadaf Ebrahimi 329*22dc650dSSadaf Ebrahimi. Pcre2grep: an option to specify the output line separator, either as a string 330*22dc650dSSadaf Ebrahimi or select from a fixed list. This is not straightforward, because at the 331*22dc650dSSadaf Ebrahimi moment it outputs whatever is in the input file. 332*22dc650dSSadaf Ebrahimi 333*22dc650dSSadaf Ebrahimi. Improve the code for duplicate checking in pcre2_dfa_match(). An incomplete, 334*22dc650dSSadaf Ebrahimi non-thread-safe patch showed that this can help performance for patterns 335*22dc650dSSadaf Ebrahimi where there are many alternatives. However, a simple thread-safe 336*22dc650dSSadaf Ebrahimi implementation that I tried made things worse in many simple cases, so this 337*22dc650dSSadaf Ebrahimi is not an obviously good thing. 338*22dc650dSSadaf Ebrahimi 339*22dc650dSSadaf Ebrahimi. PCRE2 cannot at present distinguish between subpatterns with different names, 340*22dc650dSSadaf Ebrahimi but the same number (created by the use of ?|). In order to do so, a way of 341*22dc650dSSadaf Ebrahimi remembering *which* subpattern numbered n matched is needed. (*MARK) can 342*22dc650dSSadaf Ebrahimi perhaps be used as a way round this problem. However, note that Perl does not 343*22dc650dSSadaf Ebrahimi distinguish: like PCRE2, a name is just an alias for a number in Perl. 344*22dc650dSSadaf Ebrahimi 345*22dc650dSSadaf Ebrahimi. Instead of having #ifdef HAVE_CONFIG_H in each module, put #include 346*22dc650dSSadaf Ebrahimi "something" and the the #ifdef appears only in one place, in "something". 347*22dc650dSSadaf Ebrahimi 348*22dc650dSSadaf Ebrahimi. Implement something like (?(R2+)... to check outer recursions. 349*22dc650dSSadaf Ebrahimi 350*22dc650dSSadaf Ebrahimi. If Perl ever supports the POSIX notation [[.something.]] PCRE2 should try 351*22dc650dSSadaf Ebrahimi to follow. 352*22dc650dSSadaf Ebrahimi 353*22dc650dSSadaf Ebrahimi. A user wanted a way of ignoring all Unicode "mark" characters so that, for 354*22dc650dSSadaf Ebrahimi example "a" followed by an accent would, together, match "a". This can only 355*22dc650dSSadaf Ebrahimi be done clumsily at present by using a lookahead such as /(?=a)\X/, which 356*22dc650dSSadaf Ebrahimi works for "combining" characters. 357*22dc650dSSadaf Ebrahimi 358*22dc650dSSadaf Ebrahimi. Perl supports [\N{x}-\N{y}] as a Unicode range, even in EBCDIC. PCRE2 359*22dc650dSSadaf Ebrahimi supports \N{U+dd..} everywhere, but not in EBCDIC. 360*22dc650dSSadaf Ebrahimi 361*22dc650dSSadaf Ebrahimi. Unicode stuff from Perl: 362*22dc650dSSadaf Ebrahimi 363*22dc650dSSadaf Ebrahimi \b{gcb} or \b{g} grapheme cluster boundary 364*22dc650dSSadaf Ebrahimi \b{sb} sentence boundary 365*22dc650dSSadaf Ebrahimi \b{wb} word boundary 366*22dc650dSSadaf Ebrahimi 367*22dc650dSSadaf Ebrahimi See Unicode TR 29. The last two are very much aimed at natural language. 368*22dc650dSSadaf Ebrahimi 369*22dc650dSSadaf Ebrahimi. Allow a callout to specify a number of characters to skip. This can be done 370*22dc650dSSadaf Ebrahimi compatibly via an extra callout field. 371*22dc650dSSadaf Ebrahimi 372*22dc650dSSadaf Ebrahimi. Allow callouts to return *PRUNE, *COMMIT, *THEN, *SKIP, with and without 373*22dc650dSSadaf Ebrahimi continuing (that is, with and without an implied *FAIL). A new option, 374*22dc650dSSadaf Ebrahimi PCRE2_CALLOUT_EXTENDED say, would be needed. This is unlikely ever to be 375*22dc650dSSadaf Ebrahimi implemented by JIT, so this could be an option for pcre2_match(). 376*22dc650dSSadaf Ebrahimi 377*22dc650dSSadaf Ebrahimi. A limit on substitutions: a user suggested somehow finding a way of making 378*22dc650dSSadaf Ebrahimi match_limit apply to the whole operation instead of each match separately. 379*22dc650dSSadaf Ebrahimi 380*22dc650dSSadaf Ebrahimi. Some #defines could be replaced with enums to improve robustness. 381*22dc650dSSadaf Ebrahimi 382*22dc650dSSadaf Ebrahimi. There was a request for an option for pcre2_match() to return the longest 383*22dc650dSSadaf Ebrahimi match. This would mean searching for all possible matches, of course. 384*22dc650dSSadaf Ebrahimi 385*22dc650dSSadaf Ebrahimi. Perl's /a modifier sets Unicode, but restricts \d etc to ASCII characters, 386*22dc650dSSadaf Ebrahimi which is the PCRE2 default for PCRE2_UTF (use PCRE2_UCP to change). However, 387*22dc650dSSadaf Ebrahimi Perl also has /aa, which in addition, disables ASCII/non-ASCII caseless 388*22dc650dSSadaf Ebrahimi matching. Perhaps we need a new option PCRE2_CASELESS_RESTRICT_ASCII. In 389*22dc650dSSadaf Ebrahimi practice, this just means not using the ucd_caseless_sets[] table. 390*22dc650dSSadaf Ebrahimi 391*22dc650dSSadaf Ebrahimi. There is more that could be done to the oss-fuzz setup (needs some research). 392*22dc650dSSadaf Ebrahimi A seed corpus could be built. I noted something about $LIB_FUZZING_ENGINE. 393*22dc650dSSadaf Ebrahimi The test function could make use of get_substrings() to cover more code. 394*22dc650dSSadaf Ebrahimi 395*22dc650dSSadaf Ebrahimi. A neater way of handling recursion file names in pcre2grep, e.g. a single 396*22dc650dSSadaf Ebrahimi buffer that can grow. See also GitHub issue #2 (recursion looping via 397*22dc650dSSadaf Ebrahimi symlinks). 398*22dc650dSSadaf Ebrahimi 399*22dc650dSSadaf Ebrahimi. A user suggested that before/after parameters in pcre2grep could have 400*22dc650dSSadaf Ebrahimi negative values, to list lines near to the matched line, but not necessarily 401*22dc650dSSadaf Ebrahimi the line itself. For example, --before-context=-1 would list the line *after* 402*22dc650dSSadaf Ebrahimi each matched line, without showing the matched line. The problem here is what 403*22dc650dSSadaf Ebrahimi to do with matches that are close together. Maybe a simpler way would be a 404*22dc650dSSadaf Ebrahimi flag to disable showing matched lines, only valid with either -A or -B? 405*22dc650dSSadaf Ebrahimi 406*22dc650dSSadaf Ebrahimi. There was a suggestiong for a pcre2grep colour default, or possibly a more 407*22dc650dSSadaf Ebrahimi general PCRE2GREP_OPT, but only for some options - not file names or patterns. 408*22dc650dSSadaf Ebrahimi 409*22dc650dSSadaf Ebrahimi. Breaking loops that match an empty string: perhaps find a way of continuing 410*22dc650dSSadaf Ebrahimi if *something* has changed, but this might mean remembering additional data. 411*22dc650dSSadaf Ebrahimi "Something" could be a capture value, but then a list of previous values 412*22dc650dSSadaf Ebrahimi would be needed to avoid a cycle of changes. 413*22dc650dSSadaf Ebrahimi 414*22dc650dSSadaf Ebrahimi. If a function could be written to find 3-character (or other length) fixed 415*22dc650dSSadaf Ebrahimi strings, at least one of which must be present for a match, efficient 416*22dc650dSSadaf Ebrahimi pre-searching of large datasets could be implemented. 417*22dc650dSSadaf Ebrahimi 418*22dc650dSSadaf Ebrahimi. If pcre2grep had --first-line (match only in the first line) it could be 419*22dc650dSSadaf Ebrahimi efficiently used to find files "starting with xxx". What about --last-line? 420*22dc650dSSadaf Ebrahimi There was also the suggestion of an option for pcre2grep to scan only the 421*22dc650dSSadaf Ebrahimi start of a file. I am not keen - this is the job of "head". 422*22dc650dSSadaf Ebrahimi 423*22dc650dSSadaf Ebrahimi. A user requested a means of determining whether a failed match was failed by 424*22dc650dSSadaf Ebrahimi the start-of-match optimizations, or by running the match engine. Easy enough 425*22dc650dSSadaf Ebrahimi to define a bit in the match data, but all three matchers would need work. 426*22dc650dSSadaf Ebrahimi 427*22dc650dSSadaf Ebrahimi. Would inlining "simple" recursions provide a useful performance boost for the 428*22dc650dSSadaf Ebrahimi interpreters? JIT already does some of this, but it may not be worth it for 429*22dc650dSSadaf Ebrahimi the interpreters. 430*22dc650dSSadaf Ebrahimi 431*22dc650dSSadaf Ebrahimi. Redesign handling of class/nclass/xclass because the compile code logic is 432*22dc650dSSadaf Ebrahimi currently very contorted and obscure. Also there was a request for a way of 433*22dc650dSSadaf Ebrahimi re-defining \w (and therefore \W, \b, and \B). An in-pattern sequence such as 434*22dc650dSSadaf Ebrahimi (?w=[...]) was suggested. Easiest way would be simply to inline the class, 435*22dc650dSSadaf Ebrahimi with lookarounds for \b and \B. Ideally the setting should last till the end 436*22dc650dSSadaf Ebrahimi of the group, which means remembering all previous settings; maybe a fixed 437*22dc650dSSadaf Ebrahimi amount of stack would do - how deep would anyone want to nest these things? 438*22dc650dSSadaf Ebrahimi See GitHub issue #13 for a compendium of character class issues, including 439*22dc650dSSadaf Ebrahimi (?[...]) extended classes. 440*22dc650dSSadaf Ebrahimi 441*22dc650dSSadaf Ebrahimi. A user suggested something like --with-build-info to set a build information 442*22dc650dSSadaf Ebrahimi string that could be retrieved by pcre2_config(). However, there's no 443*22dc650dSSadaf Ebrahimi facility for a length limit in pcre2_config(), and what would be the 444*22dc650dSSadaf Ebrahimi encoding? 445*22dc650dSSadaf Ebrahimi 446*22dc650dSSadaf Ebrahimi. Quantified groups with a fixed count currently operate by replicating the 447*22dc650dSSadaf Ebrahimi group in the compiled bytecode. This may not really matter in these days of 448*22dc650dSSadaf Ebrahimi gigabyte memory, but perhaps another implementation might be considered. 449*22dc650dSSadaf Ebrahimi Needs coordination between the interpreters and JIT. 450*22dc650dSSadaf Ebrahimi 451*22dc650dSSadaf Ebrahimi. The POSIX interface is no longer POSIX compatible, because regoff_t is still 452*22dc650dSSadaf Ebrahimi defined as an int. 453*22dc650dSSadaf Ebrahimi 454*22dc650dSSadaf Ebrahimi. The POSIX interface is not thread safe because it modifies a pcre2_match 455*22dc650dSSadaf Ebrahimi inside its regex_t while doing matching. A thread safe version that uses 456*22dc650dSSadaf Ebrahimi a thread local object has been proposed but it will require that the code 457*22dc650dSSadaf Ebrahimi requires at least C11 compatibility. 458*22dc650dSSadaf Ebrahimi 459*22dc650dSSadaf Ebrahimi. See also any suggestions in the GitHub issues. 460*22dc650dSSadaf Ebrahimi 461*22dc650dSSadaf EbrahimiPhilip Hazel 462*22dc650dSSadaf EbrahimiEmail local part: Philip.Hazel 463*22dc650dSSadaf EbrahimiEmail domain: gmail.com 464*22dc650dSSadaf EbrahimiLast updated: 30 November 2023 465