Update bundled pcrelib to 3.9.

# Tested under Linux only
This commit is contained in:
Wez Furlong 2002-09-14 14:45:35 +00:00
parent 53b0623878
commit a2c6a6c186
38 changed files with 3267 additions and 697 deletions

View File

@ -3,4 +3,4 @@ Written by: Philip Hazel <ph10@cam.ac.uk>
University of Cambridge Computing Service,
Cambridge, England. Phone: +44 1223 334714.
Copyright (c) 1997-2000 University of Cambridge
Copyright (c) 1997-2001 University of Cambridge

View File

@ -9,7 +9,7 @@ Written by: Philip Hazel <ph10@cam.ac.uk>
University of Cambridge Computing Service,
Cambridge, England. Phone: +44 1223 334714.
Copyright (c) 1997-2000 University of Cambridge
Copyright (c) 1997-2001 University of Cambridge
Permission is granted to anyone to use this software for any purpose on any
computer system, and to redistribute it freely, subject to the following
@ -40,7 +40,11 @@ restrictions:
misrepresented as being the original software.
4. If PCRE is embedded in any software that is released under the GNU
General Purpose Licence (GPL), then the terms of that licence shall
supersede any condition above with which it is incompatible.
General Purpose Licence (GPL), or Lesser General Purpose Licence (LGPL),
then the terms of that licence shall supersede any condition above with
which it is incompatible.
The documentation for PCRE, supplied in the "doc" directory, is distributed
under the same terms as the software itself.
End

View File

@ -1,6 +1,102 @@
ChangeLog for PCRE
------------------
Version 3.0 02-Jan-02
---------------------
1. A bit of extraneous text had somehow crept into the pcregrep documentation.
2. If --disable-static was given, the building process failed when trying to
build pcretest and pcregrep. (For some reason it was using libtool to compile
them, which is not right, as they aren't part of the library.)
Version 3.8 18-Dec-01
---------------------
1. The experimental UTF-8 code was completely screwed up. It was packing the
bytes in the wrong order. How dumb can you get?
Version 3.7 29-Oct-01
---------------------
1. In updating pcretest to check change 1 of version 3.6, I screwed up.
This caused pcretest, when used on the test data, to segfault. Unfortunately,
this didn't happen under Solaris 8, where I normally test things.
2. The Makefile had to be changed to make it work on BSD systems, where 'make'
doesn't seem to recognize that ./xxx and xxx are the same file. (This entry
isn't in ChangeLog distributed with 3.7 because I forgot when I hastily made
this fix an hour or so after the initial 3.7 release.)
Version 3.6 23-Oct-01
---------------------
1. Crashed with /(sens|respons)e and \1ibility/ and "sense and sensibility" if
offsets passed as NULL with zero offset count.
2. The config.guess and config.sub files had not been updated when I moved to
the latest autoconf.
Version 3.5 15-Aug-01
---------------------
1. Added some missing #if !defined NOPOSIX conditionals in pcretest.c that
had been forgotten.
2. By using declared but undefined structures, we can avoid using "void"
definitions in pcre.h while keeping the internal definitions of the structures
private.
3. The distribution is now built using autoconf 2.50 and libtool 1.4. From a
user point of view, this means that both static and shared libraries are built
by default, but this can be individually controlled. More of the work of
handling this static/shared cases is now inside libtool instead of PCRE's make
file.
4. The pcretest utility is now installed along with pcregrep because it is
useful for users (to test regexs) and by doing this, it automatically gets
relinked by libtool. The documentation has been turned into a man page, so
there are now .1, .txt, and .html versions in /doc.
5. Upgrades to pcregrep:
(i) Added long-form option names like gnu grep.
(ii) Added --help to list all options with an explanatory phrase.
(iii) Added -r, --recursive to recurse into sub-directories.
(iv) Added -f, --file to read patterns from a file.
6. pcre_exec() was referring to its "code" argument before testing that
argument for NULL (and giving an error if it was NULL).
7. Upgraded Makefile.in to allow for compiling in a different directory from
the source directory.
8. Tiny buglet in pcretest: when pcre_fullinfo() was called to retrieve the
options bits, the pointer it was passed was to an int instead of to an unsigned
long int. This mattered only on 64-bit systems.
9. Fixed typo (3.4/1) in pcre.h again. Sigh. I had changed pcre.h (which is
generated) instead of pcre.in, which it its source. Also made the same change
in several of the .c files.
10. A new release of gcc defines printf() as a macro, which broke pcretest
because it had an ifdef in the middle of a string argument for printf(). Fixed
by using separate calls to printf().
11. Added --enable-newline-is-cr and --enable-newline-is-lf to the configure
script, to force use of CR or LF instead of \n in the source. On non-Unix
systems, the value can be set in config.h.
12. The limit of 200 on non-capturing parentheses is a _nesting_ limit, not an
absolute limit. Changed the text of the error message to make this clear, and
likewise updated the man page.
13. The limit of 99 on the number of capturing subpatterns has been removed.
The new limit is 65535, which I hope will not be a "real" limit.
Version 3.4 22-Aug-00
---------------------

View File

@ -9,7 +9,7 @@ Written by: Philip Hazel <ph10@cam.ac.uk>
University of Cambridge Computing Service,
Cambridge, England. Phone: +44 1223 334714.
Copyright (c) 1997-2000 University of Cambridge
Copyright (c) 1997-2001 University of Cambridge
Permission is granted to anyone to use this software for any purpose on any
computer system, and to redistribute it freely, subject to the following
@ -40,7 +40,11 @@ restrictions:
misrepresented as being the original software.
4. If PCRE is embedded in any software that is released under the GNU
General Purpose Licence (GPL), then the terms of that licence shall
supersede any condition above with which it is incompatible.
General Purpose Licence (GPL), or Lesser General Purpose Licence (LGPL),
then the terms of that licence shall supersede any condition above with
which it is incompatible.
The documentation for PCRE, supplied in the "doc" directory, is distributed
under the same terms as the software itself.
End

View File

@ -1,6 +1,37 @@
News about PCRE releases
------------------------
Release 3.5 15-Aug-01
---------------------
1. The configuring system has been upgraded to use later versions of autoconf
and libtool. By default it builds both a shared and a static library if the OS
supports it. You can use --disable-shared or --disable-static on the configure
command if you want only one of them.
2. The pcretest utility is now installed along with pcregrep because it is
useful for users (to test regexs) and by doing this, it automatically gets
relinked by libtool. The documentation has been turned into a man page, so
there are now .1, .txt, and .html versions in /doc.
3. Upgrades to pcregrep:
(i) Added long-form option names like gnu grep.
(ii) Added --help to list all options with an explanatory phrase.
(iii) Added -r, --recursive to recurse into sub-directories.
(iv) Added -f, --file to read patterns from a file.
4. Added --enable-newline-is-cr and --enable-newline-is-lf to the configure
script, to force use of CR or LF instead of \n in the source. On non-Unix
systems, the value can be set in config.h.
5. The limit of 200 on non-capturing parentheses is a _nesting_ limit, not an
absolute limit. Changed the text of the error message to make this clear, and
likewise updated the man page.
6. The limit of 99 on the number of capturing subpatterns has been removed.
The new limit is 65535, which I hope will not be a "real" limit.
Release 3.3 01-Aug-00
---------------------

View File

@ -9,7 +9,10 @@ commands to do the following:
(1) Copy or rename the file config.in as config.h, and change the macros that
define HAVE_STRERROR and HAVE_MEMMOVE to define them as 1 rather than 0.
Unfortunately, because of the way Unix autoconf works, the default setting has
to be 0.
to be 0. You may also want to make changes to other macros in config.h. In
particular, if you want to force a specific value for newline, you can define
the NEWLINE macro. The default is to use '\n', thereby using whatever value
your compiler gives to '\n'.
(2) Copy or rename the file pcre.in as pcre.h, and change the macro definitions
for PCRE_MAJOR, PCRE_MINOR, and PCRE_DATE near its start to the values set in

View File

@ -17,14 +17,30 @@ that name by distributing it that way. To use it with an existing program that
uses the POSIX API, it will have to be renamed or pointed at by a link.
Contributions by users of PCRE
------------------------------
You can find contributions from PCRE users in the directory
ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/Contrib
where there is also a README file giving brief descriptions of what they are.
Several of them provide support for compiling PCRE on various flavours of
Windows systems (I myself do not use Windows). Some are complete in themselves;
others are pointers to URLs containing relevant files.
Building PCRE on a Unix system
------------------------------
To build PCRE on a Unix system, run the "configure" command in the PCRE
distribution directory. This is a standard GNU "autoconf" configuration script,
for which generic instructions are supplied in INSTALL. On many systems just
running "./configure" is sufficient, but the usual methods of changing standard
defaults are available. For example,
To build PCRE on a Unix system, first run the "configure" command from the PCRE
distribution directory, with your current directory set to the directory where
you want the files to be created. This command is a standard GNU "autoconf"
configuration script, for which generic instructions are supplied in INSTALL.
Most commonly, people build PCRE within its own distribution directory, and in
this case, on many systems, just running "./configure" is sufficient, but the
usual methods of changing standard defaults are available. For example,
CFLAGS='-O2 -Wall' ./configure --prefix=/opt/local
@ -32,14 +48,22 @@ specifies that the C compiler should be run with the flags '-O2 -Wall' instead
of the default, and that "make install" should install PCRE under /opt/local
instead of the default /usr/local.
If you want to build in a different directory, just run "configure" with that
directory as current. For example, suppose you have unpacked the PCRE source
into /source/pcre/pcre-xxx, but you want to build it in /build/pcre/pcre-xxx:
cd /build/pcre/pcre-xxx
/source/pcre/pcre-xxx/configure
If you want to make use of the experimential, incomplete support for UTF-8
character strings in PCRE, you must add --enable-utf8 to the "configure"
command. Without it, the code for handling UTF-8 is not included in the
library. (Even when included, it still has to be enabled by an option at run
time.)
The "configure" script builds four files:
The "configure" script builds five files:
. libtool is a script that builds shared and/or static libraries
. Makefile is built by copying Makefile.in and making substitutions.
. config.h is built by copying config.in and making substitutions.
. pcre-config is built by copying pcre-config.in and making substitutions.
@ -47,8 +71,9 @@ The "configure" script builds four files:
Once "configure" has run, you can run "make". It builds two libraries called
libpcre and libpcreposix, a test program called pcretest, and the pcregrep
command. You can use "make install" to copy these, and the public header file
pcre.h, to appropriate live directories on your system, in the normal way.
command. You can use "make install" to copy these, the public header files
pcre.h and pcreposix.h, and the man pages to appropriate live directories on
your system, in the normal way.
Running "make install" also installs the command pcre-config, which can be used
to recall information about the PCRE configuration and installation. For
@ -64,26 +89,38 @@ outputs information about where the library is installed. This command can be
included in makefiles for programs that use PCRE, saving the programmer from
having to remember too many details.
There is one esoteric feature that is controlled by "configure". It concerns
the character value used for "newline", and is something that you probably do
not want to change on a Unix system. The default is to use whatever value your
compiler gives to '\n'. By using --enable-newline-is-cr or
--enable-newline-is-lf you can force the value to be CR (13) or LF (10) if you
really want to.
Shared libraries on Unix systems
--------------------------------
The default distribution builds PCRE as two shared libraries. This support is
new and experimental and may not work on all systems. It relies on the
"libtool" scripts - these are distributed with PCRE. It should build a
"libtool" script and use this to compile and link shared libraries, which are
placed in a subdirectory called .libs. The programs pcretest and pcregrep are
built to use these uninstalled libraries by means of wrapper scripts. When you
use "make install" to install shared libraries, pcregrep and pcretest are
automatically re-built to use the newly installed libraries. However, only
pcregrep is installed, as pcretest is really just a test program.
The default distribution builds PCRE as two shared libraries and two static
libraries, as long as the operating system supports shared libraries. Shared
library support relies on the "libtool" script which is built as part of the
"configure" process.
To build PCRE using static libraries you must use --disable-shared when
The libtool script is used to compile and link both shared and static
libraries. They are placed in a subdirectory called .libs when they are newly
built. The programs pcretest and pcregrep are built to use these uninstalled
libraries (by means of wrapper scripts in the case of shared libraries). When
you use "make install" to install shared libraries, pcregrep and pcretest are
automatically re-built to use the newly installed shared libraries before being
installed themselves. However, the versions left in the source directory still
use the uninstalled libraries.
To build PCRE using static libraries only you must use --disable-shared when
configuring it. For example
./configure --prefix=/usr/gnu --disable-shared
Then run "make" in the usual way.
Then run "make" in the usual way. Similarly, you can use --disable-static to
build only shared libraries.
Building on non-Unix systems
@ -99,16 +136,16 @@ Standard C functions.
Testing PCRE
------------
To test PCRE on a Unix system, run the RunTest script in the pcre directory.
(This can also be run by "make runtest", "make check", or "make test".) For
other systems, see the instruction in NON-UNIX-USE.
To test PCRE on a Unix system, run the RunTest script that is created by the
configuring process. (This can also be run by "make runtest", "make check", or
"make test".) For other systems, see the instruction in NON-UNIX-USE.
The script runs the pcretest test program (which is documented in
doc/pcretest.txt) on each of the testinput files (in the testdata directory) in
turn, and compares the output with the contents of the corresponding testoutput
file. A file called testtry is used to hold the output from pcretest. To run
pcretest on just one of the test files, give its number as an argument to
RunTest, for example:
The script runs the pcretest test program (which is documented in the doc
directory) on each of the testinput files (in the testdata directory) in turn,
and compares the output with the contents of the corresponding testoutput file.
A file called testtry is used to hold the output from pcretest. To run pcretest
on just one of the test files, give its number as an argument to RunTest, for
example:
RunTest 3
@ -241,9 +278,9 @@ The distribution should contain the following files:
doc/pcregrep.html HTML version
doc/pcregrep.txt plain text version
install-sh a shell script for installing files
ltconfig ) files used to build "libtool",
ltmain.sh ) used only when building a shared library
pcretest.c test program
ltmain.sh file used to build a libtool script
pcretest.c comprehensive test program
pcredemo.c simple demonstration of coding calls to PCRE
perltest Perl test program
perltest8 Perl test program for UTF-8 tests
pcregrep.c source of a grep utility that uses PCRE
@ -267,4 +304,4 @@ The distribution should contain the following files:
pcre.def
Philip Hazel <ph10@cam.ac.uk>
August 2000
August 2001

View File

@ -6,6 +6,7 @@
# Run PCRE tests
cf=diff
testdata=./testdata
# Select which tests to run; if no selection, run all
@ -29,7 +30,7 @@ while [ $# -gt 0 ] ; do
shift
done
if [ "" = "" ] ; then
if [ "-DSUPPORT_UTF8" = "" ] ; then
if [ $do5 = yes ] ; then
echo "Can't run test 5 because UFT8 support is not configured"
exit 1
@ -46,17 +47,17 @@ if [ $do1 = no -a $do2 = no -a $do3 = no -a $do4 = no -a\
do2=yes
do3=yes
do4=yes
if [ "" != "" ] ; then do5=yes; fi
if [ "" != "" ] ; then do6=yes; fi
if [ "-DSUPPORT_UTF8" != "" ] ; then do5=yes; fi
if [ "-DSUPPORT_UTF8" != "" ] ; then do6=yes; fi
fi
# Primary test, Perl-compatible
if [ $do1 = yes ] ; then
echo "Testing main functionality (Perl compatible)"
./pcretest testdata/testinput1 testtry
./pcretest $testdata/testinput1 testtry
if [ $? = 0 ] ; then
$cf testtry testdata/testoutput1
$cf testtry $testdata/testoutput1
if [ $? != 0 ] ; then exit 1; fi
else exit 1
fi
@ -66,9 +67,9 @@ fi
if [ $do2 = yes ] ; then
echo "Testing API and error handling (not Perl compatible)"
./pcretest -i testdata/testinput2 testtry
./pcretest -i $testdata/testinput2 testtry
if [ $? = 0 ] ; then
$cf testtry testdata/testoutput2
$cf testtry $testdata/testoutput2
if [ $? != 0 ] ; then exit 1; fi
else exit 1
fi
@ -78,9 +79,9 @@ fi
if [ $do3 = yes ] ; then
echo "Testing Perl 5.005 features (Perl 5.005 compatible)"
./pcretest testdata/testinput3 testtry
./pcretest $testdata/testinput3 testtry
if [ $? = 0 ] ; then
$cf testtry testdata/testoutput3
$cf testtry $testdata/testoutput3
if [ $? != 0 ] ; then exit 1; fi
else exit 1
fi
@ -98,9 +99,9 @@ if [ $do4 = yes ] ; then
locale -a | grep '^fr$' >/dev/null
if [ $? -eq 0 ] ; then
echo "Testing locale-specific features (using 'fr' locale)"
./pcretest testdata/testinput4 testtry
./pcretest $testdata/testinput4 testtry
if [ $? = 0 ] ; then
$cf testtry testdata/testoutput4
$cf testtry $testdata/testoutput4
if [ $? != 0 ] ; then
echo " "
echo "Locale test did not run entirely successfully."
@ -123,9 +124,9 @@ fi
if [ $do5 = yes ] ; then
echo "Testing experimental, incomplete UTF8 support (Perl compatible)"
./pcretest testdata/testinput5 testtry
./pcretest $testdata/testinput5 testtry
if [ $? = 0 ] ; then
$cf testtry testdata/testoutput5
$cf testtry $testdata/testoutput5
if [ $? != 0 ] ; then exit 1; fi
else exit 1
fi
@ -135,9 +136,9 @@ fi
if [ $do6 = yes ] ; then
echo "Testing API and internals for UTF8 support (not Perl compatible)"
./pcretest testdata/testinput6 testtry
./pcretest $testdata/testinput6 testtry
if [ $? = 0 ] ; then
$cf testtry testdata/testoutput6
$cf testtry $testdata/testoutput6
if [ $? != 0 ] ; then exit 1; fi
else exit 1
fi

View File

@ -8,7 +8,7 @@ and semantics are as close as possible to those of the Perl 5 language.
Written by: Philip Hazel <ph10@cam.ac.uk>
Copyright (c) 1997-2000 University of Cambridge
Copyright (c) 1997-2001 University of Cambridge
-----------------------------------------------------------------------------
Permission is granted to anyone to use this software for any purpose on any
@ -53,7 +53,7 @@ order to be consistent. */
int main(void)
{
int i;
unsigned const char *tables = pcre_maketables();
const unsigned char *tables = pcre_maketables();
printf(
"/*************************************************\n"

View File

@ -135,7 +135,7 @@ end of each byte.
Back references
---------------
OP_REF is followed by a single byte containing the reference number.
OP_REF is followed by two bytes containing the reference number.
Repeating character classes and back references
@ -163,11 +163,21 @@ Brackets and alternation
A pair of non-capturing (round) brackets is wrapped round each expression at
compile time, so alternation always happens in the context of brackets.
Non-capturing brackets use the opcode OP_BRA, while capturing brackets use
OP_BRA+1, OP_BRA+2, etc. [Note for North Americans: "bracket" to some English
speakers, including myself, can be round, square, curly, or pointy. Hence this
usage.]
Originally PCRE was limited to 99 capturing brackets (so as not to use up all
the opcodes). From release 3.5, there is no limit. What happens is that the
first ones, up to EXTRACT_BASIC_MAX are handled with separate opcodes, as
above. If there are more, the opcode is set to EXTRACT_BASIC_MAX+1, and the
first operation in the bracket is OP_BRANUMBER, followed by a 2-byte bracket
number. This opcode is ignored while matching, but is fished out when handling
the bracket itself. (They could have all been done like this, but I was making
minimal changes.)
A bracket opcode is followed by two bytes which give the offset to the next
alternative OP_ALT or, if there aren't any branches, to the matching KET
opcode. Each OP_ALT is followed by two bytes giving the offset to the next one,
@ -191,8 +201,8 @@ appropriate.
A subpattern with a bounded maximum repetition is replicated in a nested
fashion up to the maximum number of times, with BRAZERO or BRAMINZERO before
each replication after the minimum, so that, for example, (abc){2,5} is
compiled as (abc)(abc)((abc)((abc)(abc)?)?)?. The 200-bracket limit does not
apply to these internally generated brackets.
compiled as (abc)(abc)((abc)((abc)(abc)?)?)?. The 99 and 200 bracket limits do
not apply to these internally generated brackets.
Assertions
@ -220,7 +230,7 @@ Conditional subpatterns
These are like other subpatterns, but they start with the opcode OP_COND. If
the condition is a back reference, this is stored at the start of the
subpattern using the opcode OP_CREF followed by one byte containing the
subpattern using the opcode OP_CREF followed by two bytes containing the
reference number. Otherwise, a conditional subpattern will always start with
one of the assertions.
@ -240,4 +250,4 @@ the compiled data.
Philip Hazel
August 2000
August 2001

View File

@ -92,7 +92,9 @@ contain the major and minor release numbers for the library. Applications can
use these to include support for different releases.
The functions \fBpcre_compile()\fR, \fBpcre_study()\fR, and \fBpcre_exec()\fR
are used for compiling and matching regular expressions.
are used for compiling and matching regular expressions. A sample program that
demonstrates the simplest way of using them is given in the file
\fIpcredemo.c\fR. The last section of this man page describes how to run it.
The functions \fBpcre_copy_substring()\fR, \fBpcre_get_substring()\fR, and
\fBpcre_get_substring_list()\fR are convenience functions for extracting
@ -129,18 +131,22 @@ the same compiled pattern can safely be used by several threads at once.
The function \fBpcre_compile()\fR is called to compile a pattern into an
internal form. The pattern is a C string terminated by a binary zero, and
is passed in the argument \fIpattern\fR. A pointer to a single block of memory
that is obtained via \fBpcre_malloc\fR is returned. This contains the
compiled code and related data. The \fBpcre\fR type is defined for this for
convenience, but in fact \fBpcre\fR is just a typedef for \fBvoid\fR, since the
contents of the block are not externally defined. It is up to the caller to
free the memory when it is no longer required.
.PP
that is obtained via \fBpcre_malloc\fR is returned. This contains the compiled
code and related data. The \fBpcre\fR type is defined for the returned block;
this is a typedef for a structure whose contents are not externally defined. It
is up to the caller to free the memory when it is no longer required.
Although the compiled code of a PCRE regex is relocatable, that is, it does not
depend on memory location, the complete \fBpcre\fR data block is not
fully relocatable, because it contains a copy of the \fItableptr\fR argument,
which is an address (see below).
The size of a compiled pattern is roughly proportional to the length of the
pattern string, except that each character class (other than those containing
just a single character, negated or not) requires 33 bytes, and repeat
quantifiers with a minimum greater than one or a bounded maximum cause the
relevant portions of the compiled pattern to be replicated.
.PP
The \fIoptions\fR argument contains independent bits that affect the
compilation. It should be zero if no options are required. Some of the options,
in particular, those that are compatible with Perl, can also be set and unset
@ -149,19 +155,31 @@ below). For these options, the contents of the \fIoptions\fR argument specifies
their initial settings at the start of compilation and execution. The
PCRE_ANCHORED option can be set at the time of matching as well as at compile
time.
.PP
If \fIerrptr\fR is NULL, \fBpcre_compile()\fR returns NULL immediately.
Otherwise, if compilation of a pattern fails, \fBpcre_compile()\fR returns
NULL, and sets the variable pointed to by \fIerrptr\fR to point to a textual
error message. The offset from the start of the pattern to the character where
the error was discovered is placed in the variable pointed to by
\fIerroffset\fR, which must not be NULL. If it is, an immediate error is given.
.PP
If the final argument, \fItableptr\fR, is NULL, PCRE uses a default set of
character tables which are built when it is compiled, using the default C
locale. Otherwise, \fItableptr\fR must be the result of a call to
\fBpcre_maketables()\fR. See the section on locale support below.
.PP
This code fragment shows a typical straightforward call to \fBpcre_compile()\fR:
pcre *re;
const char *error;
int erroffset;
re = pcre_compile(
"^A.*Z", /* the pattern */
0, /* default options */
&error, /* for error message */
&erroffset, /* for error offset */
NULL); /* use default character tables */
The following option bits are defined in the header file:
PCRE_ANCHORED
@ -248,10 +266,10 @@ Details of exactly what it entails are given below.
When a pattern is going to be used several times, it is worth spending more
time analyzing it in order to speed up the time taken for matching. The
function \fBpcre_study()\fR takes a pointer to a compiled pattern as its first
argument, and returns a pointer to a \fBpcre_extra\fR block (another \fBvoid\fR
typedef) containing additional information about the pattern; this can be
passed to \fBpcre_exec()\fR. If no additional information is available, NULL
is returned.
argument, and returns a pointer to a \fBpcre_extra\fR block (another typedef
for a structure with hidden contents) containing additional information about
the pattern; this can be passed to \fBpcre_exec()\fR. If no additional
information is available, NULL is returned.
The second argument contains option bits. At present, no options are defined
for \fBpcre_study()\fR, and this argument should always be zero.
@ -260,6 +278,14 @@ The third argument for \fBpcre_study()\fR is a pointer to an error message. If
studying succeeds (even if no data is returned), the variable it points to is
set to NULL. Otherwise it points to a textual error message.
This is a typical call to \fBpcre_study\fR():
pcre_extra *pe;
pe = pcre_study(
re, /* result of pcre_compile() */
0, /* no options exist */
&error); /* set to NULL or points to a message */
At present, studying a pattern is useful only for non-anchored patterns that do
not have a single fixed starting character. A bitmap of possible starting
characters is created.
@ -309,13 +335,24 @@ the following negative numbers:
PCRE_ERROR_BADMAGIC the "magic number" was not found
PCRE_ERROR_BADOPTION the value of \fIwhat\fR was invalid
Here is a typical call of \fBpcre_fullinfo()\fR, to obtain the length of the
compiled pattern:
int rc;
unsigned long int length;
rc = pcre_fullinfo(
re, /* result of pcre_compile() */
pe, /* result of pcre_study(), or NULL */
PCRE_INFO_SIZE, /* what is required */
&length); /* where to put the data */
The possible values for the third argument are defined in \fBpcre.h\fR, and are
as follows:
PCRE_INFO_OPTIONS
Return a copy of the options with which the pattern was compiled. The fourth
argument should point to au \fBunsigned long int\fR variable. These option bits
argument should point to an \fBunsigned long int\fR variable. These option bits
are those specified in the call to \fBpcre_compile()\fR, modified by any
top-level option settings within the pattern itself, and with the PCRE_ANCHORED
bit forcibly set if the form of the pattern implies that it can match only at
@ -396,6 +433,20 @@ pre-compiled pattern, which is passed in the \fIcode\fR argument. If the
pattern has been studied, the result of the study should be passed in the
\fIextra\fR argument. Otherwise this must be NULL.
Here is an example of a simple call to \fBpcre_exec()\fR:
int rc;
int ovector[30];
rc = pcre_exec(
re, /* result of pcre_compile() */
NULL, /* we didn't study the pattern */
"some string", /* the subject string */
11, /* the length of the subject string */
0, /* start at offset 0 in the subject */
0, /* default options */
ovector, /* vector for substring information */
30); /* number of elements in the vector */
The PCRE_ANCHORED option can be passed in the \fIoptions\fR argument, whose
unused bits must be zero. However, if a pattern was compiled with
PCRE_ANCHORED, or turned out to be anchored by virtue of its contents, it
@ -437,9 +488,9 @@ below) and trying an ordinary match again.
The subject string is passed as a pointer in \fIsubject\fR, a length in
\fIlength\fR, and a starting offset in \fIstartoffset\fR. Unlike the pattern
string, it may contain binary zero characters. When the starting offset is
zero, the search for a match starts at the beginning of the subject, and this
is by far the most common case.
string, the subject may contain binary zero characters. When the starting
offset is zero, the search for a match starts at the beginning of the subject,
and this is by far the most common case.
A non-zero starting offset is useful when searching for another match in the
same subject by calling \fBpcre_exec()\fR again after a previous success.
@ -626,8 +677,9 @@ There are some size limitations in PCRE but it is hoped that they will never in
practice be relevant.
The maximum length of a compiled pattern is 65539 (sic) bytes.
All values in repeating quantifiers must be less than 65536.
The maximum number of capturing subpatterns is 99.
The maximum number of all parenthesized subpatterns, including capturing
There maximum number of capturing subpatterns is 65535.
There is no limit to the number of non-capturing subpatterns, but the maximum
depth of nesting of all kinds of parenthesized subpattern, including capturing
subpatterns, assertions, and other types of subpattern, is 200.
The maximum length of a subject string is the largest positive number that an
@ -949,7 +1001,7 @@ PCRE_MULTILINE is set.
Note that the sequences \\A, \\Z, and \\z can be used to match the start and
end of the subject in both modes, and if all branches of a pattern start with
\\A is it always anchored, whether PCRE_MULTILINE is set or not.
\\A it is always anchored, whether PCRE_MULTILINE is set or not.
.SH FULL STOP (PERIOD, DOT)
@ -1053,7 +1105,7 @@ negation, which is indicated by a ^ character after the colon. For example,
[12[:^digit:]]
matches "1", "2", or any non-digit. PCRE (and Perl) also recogize the POSIX
matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the POSIX
syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
supported, and an error is given if they are encountered.
@ -1151,7 +1203,7 @@ For example, if the string "the red king" is matched against the pattern
the ((red|white) (king|queen))
the captured substrings are "red king", "red", and "king", and are numbered 1,
2, and 3.
2, and 3, respectively.
The fact that plain parentheses fulfil two functions is not always helpful.
There are often times when a grouping subpattern is required without a
@ -1792,6 +1844,137 @@ The following UTF-8 features of Perl 5.6 are not implemented:
2. The use of Unicode tables and properties and escapes \\p, \\P, and \\X.
.SH SAMPLE PROGRAM
The code below is a simple, complete demonstration program, to get you started
with using PCRE. This code is also supplied in the file \fIpcredemo.c\fR in the
PCRE distribution.
The program compiles the regular expression that is its first argument, and
matches it against the subject string in its second argument. No options are
set, and default character tables are used. If matching succeeds, the program
outputs the portion of the subject that matched, together with the contents of
any captured substrings.
On a Unix system that has PCRE installed in \fI/usr/local\fR, you can compile
the demonstration program using a command like this:
gcc -o pcredemo pcredemo.c -I/usr/local/include -L/usr/local/lib -lpcre
Then you can run simple tests like this:
./pcredemo 'cat|dog' 'the cat sat on the mat'
Note that there is a much more comprehensive test program, called
\fBpcretest\fR, which supports many more facilities for testing regular
expressions. The \fBpcredemo\fR program is provided as a simple coding example.
On some operating systems (e.g. Solaris) you may get an error like this when
you try to run \fBpcredemo\fR:
ld.so.1: a.out: fatal: libpcre.so.0: open failed: No such file or directory
This is caused by the way shared library support works on those systems. You
need to add
-R/usr/local/lib
to the compile command to get round this problem. Here's the code:
#include <stdio.h>
#include <string.h>
#include <pcre.h>
#define OVECCOUNT 30 /* should be a multiple of 3 */
int main(int argc, char **argv)
{
pcre *re;
const char *error;
int erroffset;
int ovector[OVECCOUNT];
int rc, i;
if (argc != 3)
{
printf("Two arguments required: a regex and a "
"subject string\\n");
return 1;
}
/* Compile the regular expression in the first argument */
re = pcre_compile(
argv[1], /* the pattern */
0, /* default options */
&error, /* for error message */
&erroffset, /* for error offset */
NULL); /* use default character tables */
/* Compilation failed: print the error message and exit */
if (re == NULL)
{
printf("PCRE compilation failed at offset %d: %s\\n",
erroffset, error);
return 1;
}
/* Compilation succeeded: match the subject in the second
argument */
rc = pcre_exec(
re, /* the compiled pattern */
NULL, /* we didn't study the pattern */
argv[2], /* the subject string */
(int)strlen(argv[2]), /* the length of the subject */
0, /* start at offset 0 in the subject */
0, /* default options */
ovector, /* vector for substring information */
OVECCOUNT); /* number of elements in the vector */
/* Matching failed: handle error cases */
if (rc < 0)
{
switch(rc)
{
case PCRE_ERROR_NOMATCH: printf("No match\\n"); break;
/*
Handle other special cases if you like
*/
default: printf("Matching error %d\\n", rc); break;
}
return 1;
}
/* Match succeded */
printf("Match succeeded\\n");
/* The output vector wasn't big enough */
if (rc == 0)
{
rc = OVECCOUNT/3;
printf("ovector only has room for %d captured "
substrings\\n", rc - 1);
}
/* Show substrings stored in the output vector */
for (i = 0; i < rc; i++)
{
char *substring_start = argv[2] + ovector[2*i];
int substring_length = ovector[2*i+1] - ovector[2*i];
printf("%2d: %.*s\\n", i, substring_length,
substring_start);
}
return 0;
}
.SH AUTHOR
Philip Hazel <ph10@cam.ac.uk>
.br
@ -1803,8 +1986,6 @@ Cambridge CB2 3QG, England.
.br
Phone: +44 1223 334714
Last updated: 28 August 2000,
Last updated: 15 August 2001
.br
the 250th anniversary of the death of J.S. Bach.
.br
Copyright (c) 1997-2000 University of Cambridge.
Copyright (c) 1997-2001 University of Cambridge.

View File

@ -38,7 +38,8 @@ conversion went wrong.
<LI><A NAME="TOC28" HREF="#SEC28">RECURSIVE PATTERNS</A>
<LI><A NAME="TOC29" HREF="#SEC29">PERFORMANCE</A>
<LI><A NAME="TOC30" HREF="#SEC30">UTF-8 SUPPORT</A>
<LI><A NAME="TOC31" HREF="#SEC31">AUTHOR</A>
<LI><A NAME="TOC31" HREF="#SEC31">SAMPLE PROGRAM</A>
<LI><A NAME="TOC32" HREF="#SEC32">AUTHOR</A>
</UL>
<LI><A NAME="SEC1" HREF="#TOC1">NAME</A>
<P>
@ -126,7 +127,9 @@ use these to include support for different releases.
</P>
<P>
The functions <B>pcre_compile()</B>, <B>pcre_study()</B>, and <B>pcre_exec()</B>
are used for compiling and matching regular expressions.
are used for compiling and matching regular expressions. A sample program that
demonstrates the simplest way of using them is given in the file
<I>pcredemo.c</I>. The last section of this man page describes how to run it.
</P>
<P>
The functions <B>pcre_copy_substring()</B>, <B>pcre_get_substring()</B>, and
@ -168,11 +171,16 @@ the same compiled pattern can safely be used by several threads at once.
The function <B>pcre_compile()</B> is called to compile a pattern into an
internal form. The pattern is a C string terminated by a binary zero, and
is passed in the argument <I>pattern</I>. A pointer to a single block of memory
that is obtained via <B>pcre_malloc</B> is returned. This contains the
compiled code and related data. The <B>pcre</B> type is defined for this for
convenience, but in fact <B>pcre</B> is just a typedef for <B>void</B>, since the
contents of the block are not externally defined. It is up to the caller to
free the memory when it is no longer required.
that is obtained via <B>pcre_malloc</B> is returned. This contains the compiled
code and related data. The <B>pcre</B> type is defined for the returned block;
this is a typedef for a structure whose contents are not externally defined. It
is up to the caller to free the memory when it is no longer required.
</P>
<P>
Although the compiled code of a PCRE regex is relocatable, that is, it does not
depend on memory location, the complete <B>pcre</B> data block is not
fully relocatable, because it contains a copy of the <I>tableptr</I> argument,
which is an address (see below).
</P>
<P>
The size of a compiled pattern is roughly proportional to the length of the
@ -206,6 +214,22 @@ locale. Otherwise, <I>tableptr</I> must be the result of a call to
<B>pcre_maketables()</B>. See the section on locale support below.
</P>
<P>
This code fragment shows a typical straightforward call to <B>pcre_compile()</B>:
</P>
<P>
<PRE>
pcre *re;
const char *error;
int erroffset;
re = pcre_compile(
"^A.*Z", /* the pattern */
0, /* default options */
&error, /* for error message */
&erroffset, /* for error offset */
NULL); /* use default character tables */
</PRE>
</P>
<P>
The following option bits are defined in the header file:
</P>
<P>
@ -329,10 +353,10 @@ Details of exactly what it entails are given below.
When a pattern is going to be used several times, it is worth spending more
time analyzing it in order to speed up the time taken for matching. The
function <B>pcre_study()</B> takes a pointer to a compiled pattern as its first
argument, and returns a pointer to a <B>pcre_extra</B> block (another <B>void</B>
typedef) containing additional information about the pattern; this can be
passed to <B>pcre_exec()</B>. If no additional information is available, NULL
is returned.
argument, and returns a pointer to a <B>pcre_extra</B> block (another typedef
for a structure with hidden contents) containing additional information about
the pattern; this can be passed to <B>pcre_exec()</B>. If no additional
information is available, NULL is returned.
</P>
<P>
The second argument contains option bits. At present, no options are defined
@ -344,6 +368,18 @@ studying succeeds (even if no data is returned), the variable it points to is
set to NULL. Otherwise it points to a textual error message.
</P>
<P>
This is a typical call to <B>pcre_study</B>():
</P>
<P>
<PRE>
pcre_extra *pe;
pe = pcre_study(
re, /* result of pcre_compile() */
0, /* no options exist */
&error); /* set to NULL or points to a message */
</PRE>
</P>
<P>
At present, studying a pattern is useful only for non-anchored patterns that do
not have a single fixed starting character. A bitmap of possible starting
characters is created.
@ -403,6 +439,21 @@ the following negative numbers:
</PRE>
</P>
<P>
Here is a typical call of <B>pcre_fullinfo()</B>, to obtain the length of the
compiled pattern:
</P>
<P>
<PRE>
int rc;
unsigned long int length;
rc = pcre_fullinfo(
re, /* result of pcre_compile() */
pe, /* result of pcre_study(), or NULL */
PCRE_INFO_SIZE, /* what is required */
&length); /* where to put the data */
</PRE>
</P>
<P>
The possible values for the third argument are defined in <B>pcre.h</B>, and are
as follows:
</P>
@ -413,7 +464,7 @@ as follows:
</P>
<P>
Return a copy of the options with which the pattern was compiled. The fourth
argument should point to au <B>unsigned long int</B> variable. These option bits
argument should point to an <B>unsigned long int</B> variable. These option bits
are those specified in the call to <B>pcre_compile()</B>, modified by any
top-level option settings within the pattern itself, and with the PCRE_ANCHORED
bit forcibly set if the form of the pattern implies that it can match only at
@ -528,6 +579,24 @@ pattern has been studied, the result of the study should be passed in the
<I>extra</I> argument. Otherwise this must be NULL.
</P>
<P>
Here is an example of a simple call to <B>pcre_exec()</B>:
</P>
<P>
<PRE>
int rc;
int ovector[30];
rc = pcre_exec(
re, /* result of pcre_compile() */
NULL, /* we didn't study the pattern */
"some string", /* the subject string */
11, /* the length of the subject string */
0, /* start at offset 0 in the subject */
0, /* default options */
ovector, /* vector for substring information */
30); /* number of elements in the vector */
</PRE>
</P>
<P>
The PCRE_ANCHORED option can be passed in the <I>options</I> argument, whose
unused bits must be zero. However, if a pattern was compiled with
PCRE_ANCHORED, or turned out to be anchored by virtue of its contents, it
@ -588,9 +657,9 @@ below) and trying an ordinary match again.
<P>
The subject string is passed as a pointer in <I>subject</I>, a length in
<I>length</I>, and a starting offset in <I>startoffset</I>. Unlike the pattern
string, it may contain binary zero characters. When the starting offset is
zero, the search for a match starts at the beginning of the subject, and this
is by far the most common case.
string, the subject may contain binary zero characters. When the starting
offset is zero, the search for a match starts at the beginning of the subject,
and this is by far the most common case.
</P>
<P>
A non-zero starting offset is useful when searching for another match in the
@ -833,8 +902,9 @@ There are some size limitations in PCRE but it is hoped that they will never in
practice be relevant.
The maximum length of a compiled pattern is 65539 (sic) bytes.
All values in repeating quantifiers must be less than 65536.
The maximum number of capturing subpatterns is 99.
The maximum number of all parenthesized subpatterns, including capturing
There maximum number of capturing subpatterns is 65535.
There is no limit to the number of non-capturing subpatterns, but the maximum
depth of nesting of all kinds of parenthesized subpattern, including capturing
subpatterns, assertions, and other types of subpattern, is 200.
</P>
<P>
@ -1225,7 +1295,7 @@ PCRE_MULTILINE is set.
<P>
Note that the sequences \A, \Z, and \z can be used to match the start and
end of the subject in both modes, and if all branches of a pattern start with
\A is it always anchored, whether PCRE_MULTILINE is set or not.
\A it is always anchored, whether PCRE_MULTILINE is set or not.
</P>
<LI><A NAME="SEC16" HREF="#TOC1">FULL STOP (PERIOD, DOT)</A>
<P>
@ -1350,7 +1420,7 @@ negation, which is indicated by a ^ character after the colon. For example,
</PRE>
</P>
<P>
matches "1", "2", or any non-digit. PCRE (and Perl) also recogize the POSIX
matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the POSIX
syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
supported, and an error is given if they are encountered.
</P>
@ -1482,7 +1552,7 @@ For example, if the string "the red king" is matched against the pattern
</P>
<P>
the captured substrings are "red king", "red", and "king", and are numbered 1,
2, and 3.
2, and 3, respectively.
</P>
<P>
The fact that plain parentheses fulfil two functions is not always helpful.
@ -2375,7 +2445,213 @@ The following UTF-8 features of Perl 5.6 are not implemented:
<P>
2. The use of Unicode tables and properties and escapes \p, \P, and \X.
</P>
<LI><A NAME="SEC31" HREF="#TOC1">AUTHOR</A>
<LI><A NAME="SEC31" HREF="#TOC1">SAMPLE PROGRAM</A>
<P>
The code below is a simple, complete demonstration program, to get you started
with using PCRE. This code is also supplied in the file <I>pcredemo.c</I> in the
PCRE distribution.
</P>
<P>
The program compiles the regular expression that is its first argument, and
matches it against the subject string in its second argument. No options are
set, and default character tables are used. If matching succeeds, the program
outputs the portion of the subject that matched, together with the contents of
any captured substrings.
</P>
<P>
On a Unix system that has PCRE installed in <I>/usr/local</I>, you can compile
the demonstration program using a command like this:
</P>
<P>
<PRE>
gcc -o pcredemo pcredemo.c -I/usr/local/include -L/usr/local/lib -lpcre
</PRE>
</P>
<P>
Then you can run simple tests like this:
</P>
<P>
<PRE>
./pcredemo 'cat|dog' 'the cat sat on the mat'
</PRE>
</P>
<P>
Note that there is a much more comprehensive test program, called
<B>pcretest</B>, which supports many more facilities for testing regular
expressions. The <B>pcredemo</B> program is provided as a simple coding example.
</P>
<P>
On some operating systems (e.g. Solaris) you may get an error like this when
you try to run <B>pcredemo</B>:
</P>
<P>
<PRE>
ld.so.1: a.out: fatal: libpcre.so.0: open failed: No such file or directory
</PRE>
</P>
<P>
This is caused by the way shared library support works on those systems. You
need to add
</P>
<P>
<PRE>
-R/usr/local/lib
</PRE>
</P>
<P>
to the compile command to get round this problem. Here's the code:
</P>
<P>
<PRE>
#include &#60;stdio.h&#62;
#include &#60;string.h&#62;
#include &#60;pcre.h&#62;
</PRE>
</P>
<P>
<PRE>
#define OVECCOUNT 30 /* should be a multiple of 3 */
</PRE>
</P>
<P>
<PRE>
int main(int argc, char **argv)
{
pcre *re;
const char *error;
int erroffset;
int ovector[OVECCOUNT];
int rc, i;
</PRE>
</P>
<P>
<PRE>
if (argc != 3)
{
printf("Two arguments required: a regex and a "
"subject string\n");
return 1;
}
</PRE>
</P>
<P>
<PRE>
/* Compile the regular expression in the first argument */
</PRE>
</P>
<P>
<PRE>
re = pcre_compile(
argv[1], /* the pattern */
0, /* default options */
&error, /* for error message */
&erroffset, /* for error offset */
NULL); /* use default character tables */
</PRE>
</P>
<P>
<PRE>
/* Compilation failed: print the error message and exit */
</PRE>
</P>
<P>
<PRE>
if (re == NULL)
{
printf("PCRE compilation failed at offset %d: %s\n",
erroffset, error);
return 1;
}
</PRE>
</P>
<P>
<PRE>
/* Compilation succeeded: match the subject in the second
argument */
</PRE>
</P>
<P>
<PRE>
rc = pcre_exec(
re, /* the compiled pattern */
NULL, /* we didn't study the pattern */
argv[2], /* the subject string */
(int)strlen(argv[2]), /* the length of the subject */
0, /* start at offset 0 in the subject */
0, /* default options */
ovector, /* vector for substring information */
OVECCOUNT); /* number of elements in the vector */
</PRE>
</P>
<P>
<PRE>
/* Matching failed: handle error cases */
</PRE>
</P>
<P>
<PRE>
if (rc &#60; 0)
{
switch(rc)
{
case PCRE_ERROR_NOMATCH: printf("No match\n"); break;
/*
Handle other special cases if you like
*/
default: printf("Matching error %d\n", rc); break;
}
return 1;
}
</PRE>
</P>
<P>
<PRE>
/* Match succeded */
</PRE>
</P>
<P>
<PRE>
printf("Match succeeded\n");
</PRE>
</P>
<P>
<PRE>
/* The output vector wasn't big enough */
</PRE>
</P>
<P>
<PRE>
if (rc == 0)
{
rc = OVECCOUNT/3;
printf("ovector only has room for %d captured "
substrings\n", rc - 1);
}
</PRE>
</P>
<P>
<PRE>
/* Show substrings stored in the output vector */
</PRE>
</P>
<P>
<PRE>
for (i = 0; i &#60; rc; i++)
{
char *substring_start = argv[2] + ovector[2*i];
int substring_length = ovector[2*i+1] - ovector[2*i];
printf("%2d: %.*s\n", i, substring_length,
substring_start);
}
</PRE>
</P>
<P>
<PRE>
return 0;
}
</PRE>
</P>
<LI><A NAME="SEC32" HREF="#TOC1">AUTHOR</A>
<P>
Philip Hazel &#60;ph10@cam.ac.uk&#62;
<BR>
@ -2388,10 +2664,6 @@ Cambridge CB2 3QG, England.
Phone: +44 1223 334714
</P>
<P>
Last updated: 28 August 2000,
Last updated: 15 August 2001
<BR>
<PRE>
the 250th anniversary of the death of J.S. Bach.
<BR>
</PRE>
Copyright (c) 1997-2000 University of Cambridge.
Copyright (c) 1997-2001 University of Cambridge.

View File

@ -74,7 +74,10 @@ DESCRIPTION
releases.
The functions pcre_compile(), pcre_study(), and pcre_exec()
are used for compiling and matching regular expressions.
are used for compiling and matching regular expressions. A
sample program that demonstrates the simplest way of using
them is given in the file pcredemo.c. The last section of
this man page describes how to run it.
The functions pcre_copy_substring(), pcre_get_substring(),
and pcre_get_substring_list() are convenience functions for
@ -104,19 +107,10 @@ DESCRIPTION
MULTI-THREADING
The PCRE functions can be used in multi-threading
SunOS 5.8 Last change: 2
applications, with the proviso that the memory management
functions pointed to by pcre_malloc and pcre_free are shared
by all threads.
The PCRE functions can be used in multi-threading applica-
tions, with the proviso that the memory management functions
pointed to by pcre_malloc and pcre_free are shared by all
threads.
The compiled form of a regular expression is not altered
during matching, so the same compiled pattern can safely be
@ -130,11 +124,16 @@ COMPILING A PATTERN
by a binary zero, and is passed in the argument pattern. A
pointer to a single block of memory that is obtained via
pcre_malloc is returned. This contains the compiled code and
related data. The pcre type is defined for this for conveni-
ence, but in fact pcre is just a typedef for void, since the
contents of the block are not externally defined. It is up
to the caller to free the memory when it is no longer
required.
related data. The pcre type is defined for the returned
block; this is a typedef for a structure whose contents are
not externally defined. It is up to the caller to free the
memory when it is no longer required.
Although the compiled code of a PCRE regex is relocatable,
that is, it does not depend on memory location, the complete
pcre data block is not fully relocatable, because it con-
tains a copy of the tableptr argument, which is an address
(see below).
The size of a compiled pattern is roughly proportional to
the length of the pattern string, except that each character
@ -169,6 +168,19 @@ COMPILING A PATTERN
must be the result of a call to pcre_maketables(). See the
section on locale support below.
This code fragment shows a typical straightforward call to
pcre_compile():
pcre *re;
const char *error;
int erroffset;
re = pcre_compile(
"^A.*Z", /* the pattern */
0, /* default options */
&error, /* for error message */
&erroffset, /* for error offset */
NULL); /* use default character tables */
The following option bits are defined in the header file:
PCRE_ANCHORED
@ -271,12 +283,12 @@ STUDYING A PATTERN
When a pattern is going to be used several times, it is
worth spending more time analyzing it in order to speed up
the time taken for matching. The function pcre_study() takes
a pointer to a compiled pattern as its first argument, and
returns a pointer to a pcre_extra block (another void
typedef) containing additional information about the pat-
tern; this can be passed to pcre_exec(). If no additional
information is available, NULL is returned.
returns a pointer to a pcre_extra block (another typedef for
a structure with hidden contents) containing additional
information about the pattern; this can be passed to
pcre_exec(). If no additional information is available, NULL
is returned.
The second argument contains option bits. At present, no
options are defined for pcre_study(), and this argument
@ -287,6 +299,14 @@ STUDYING A PATTERN
the variable it points to is set to NULL. Otherwise it
points to a textual error message.
This is a typical call to pcre_study():
pcre_extra *pe;
pe = pcre_study(
re, /* result of pcre_compile() */
0, /* no options exist */
&error); /* set to NULL or points to a message */
At present, studying a pattern is useful only for non-
anchored patterns that do not have a single fixed starting
character. A bitmap of possible starting characters is
@ -347,13 +367,24 @@ INFORMATION ABOUT A PATTERN
PCRE_ERROR_BADMAGIC the "magic number" was not found
PCRE_ERROR_BADOPTION the value of what was invalid
Here is a typical call of pcre_fullinfo(), to obtain the
length of the compiled pattern:
int rc;
unsigned long int length;
rc = pcre_fullinfo(
re, /* result of pcre_compile() */
pe, /* result of pcre_study(), or NULL */
PCRE_INFO_SIZE, /* what is required */
&length); /* where to put the data */
The possible values for the third argument are defined in
pcre.h, and are as follows:
PCRE_INFO_OPTIONS
Return a copy of the options with which the pattern was com-
piled. The fourth argument should point to au unsigned long
piled. The fourth argument should point to an unsigned long
int variable. These option bits are those specified in the
call to pcre_compile(), modified by any top-level option
settings within the pattern itself, and with the
@ -375,9 +406,9 @@ INFORMATION ABOUT A PATTERN
PCRE_INFO_BACKREFMAX
Return the number of the highest back reference in the
pattern. The fourth argument should point to an int vari-
able. Zero is returned if there are no back references.
Return the number of the highest back reference in the pat-
tern. The fourth argument should point to an int variable.
Zero is returned if there are no back references.
PCRE_INFO_FIRSTCHAR
@ -440,11 +471,34 @@ INFORMATION ABOUT A PATTERN
MATCHING A PATTERN
The function pcre_exec() is called to match a subject string
SunOS 5.8 Last change: 9
against a pre-compiled pattern, which is passed in the code
argument. If the pattern has been studied, the result of the
study should be passed in the extra argument. Otherwise this
must be NULL.
Here is an example of a simple call to pcre_exec():
int rc;
int ovector[30];
rc = pcre_exec(
re, /* result of pcre_compile() */
NULL, /* we didn't study the pattern */
"some string", /* the subject string */
11, /* the length of the subject string */
0, /* start at offset 0 in the subject */
0, /* default options */
ovector, /* vector for substring information */
30); /* number of elements in the vector */
The PCRE_ANCHORED option can be passed in the options argu-
ment, whose unused bits must be zero. However, if a pattern
was compiled with PCRE_ANCHORED, or turned out to be
@ -495,10 +549,10 @@ MATCHING A PATTERN
The subject string is passed as a pointer in subject, a
length in length, and a starting offset in startoffset.
Unlike the pattern string, it may contain binary zero char-
acters. When the starting offset is zero, the search for a
match starts at the beginning of the subject, and this is by
far the most common case.
Unlike the pattern string, the subject may contain binary
zero characters. When the starting offset is zero, the
search for a match starts at the beginning of the subject,
and this is by far the most common case.
A non-zero starting offset is useful when searching for
another match in the same subject by calling pcre_exec()
@ -634,17 +688,9 @@ MATCHING A PATTERN
EXTRACTING CAPTURED SUBSTRINGS
Captured substrings can be accessed directly by using the
SunOS 5.8 Last change: 12
offsets returned by pcre_exec() in ovector. For convenience,
the functions pcre_copy_substring(), pcre_get_substring(),
and pcre_get_substring_list() are provided for extracting
@ -722,10 +768,12 @@ LIMITATIONS
There are some size limitations in PCRE but it is hoped that
they will never in practice be relevant. The maximum length
of a compiled pattern is 65539 (sic) bytes. All values in
repeating quantifiers must be less than 65536. The maximum
number of capturing subpatterns is 99. The maximum number
of all parenthesized subpatterns, including capturing sub-
patterns, assertions, and other types of subpattern, is 200.
repeating quantifiers must be less than 65536. There max-
imum number of capturing subpatterns is 65535. There is no
limit to the number of non-capturing subpatterns, but the
maximum depth of nesting of all kinds of parenthesized sub-
pattern, including capturing subpatterns, assertions, and
other types of subpattern, is 200.
The maximum length of a subject string is the largest posi-
tive number that an integer variable can hold. However, PCRE
@ -901,6 +949,7 @@ BACKSLASH
The backslash character has several uses. Firstly, if it is
followed by a non-alphameric character, it takes away any
special meaning that character may have. This use of
backslash as an escape character applies both inside and
outside character classes.
@ -1061,7 +1110,6 @@ CIRCUMFLEX AND DOLLAR
Outside a character class, in the default matching mode, the
circumflex character is an assertion which is true only if
the current matching point is at the start of the subject
string. If the startoffset argument of pcre_exec() is non-
zero, circumflex can never match. Inside a character class,
circumflex has an entirely different meaning (see below).
@ -1105,7 +1153,7 @@ CIRCUMFLEX AND DOLLAR
Note that the sequences \A, \Z, and \z can be used to match
the start and end of the subject in both modes, and if all
branches of a pattern start with \A is it always anchored,
branches of a pattern start with \A it is always anchored,
whether PCRE_MULTILINE is set or not.
@ -1114,7 +1162,6 @@ FULL STOP (PERIOD, DOT)
Outside a character class, a dot in the pattern matches any
one character in the subject, including a non-printing char-
acter, but not (by default) newline. If the PCRE_DOTALL
option is set, dots match newlines as well. The handling of
dot is entirely independent of the handling of circumflex
and dollar, the only relationship being that they both
@ -1233,7 +1280,7 @@ POSIX CHARACTER CLASSES
[12[:^digit:]]
matches "1", "2", or any non-digit. PCRE (and Perl) also
recogize the POSIX syntax [.ch.] and [=ch=] where "ch" is a
recognize the POSIX syntax [.ch.] and [=ch=] where "ch" is a
"collating element", but these are not supported, and an
error is given if they are encountered.
@ -1352,7 +1399,7 @@ SUBPATTERNS
the ((red|white) (king|queen))
the captured substrings are "red king", "red", and "king",
and are numbered 1, 2, and 3.
and are numbered 1, 2, and 3, respectively.
The fact that plain parentheses fulfil two functions is not
always helpful. There are often times when a grouping sub-
@ -1423,7 +1470,6 @@ REPETITION
one that does not match the syntax of a quantifier, is taken
as a literal character. For example, {,6} is not a quantif-
ier, but a literal string of four characters.
The quantifier {0} is permitted, causing the expression to
behave as if the previous item and the quantifier were not
present.
@ -1528,6 +1574,14 @@ REPETITION
BACK REFERENCES
Outside a character class, a backslash followed by a digit
greater than 0 (and possibly further digits) is a back
SunOS 5.8 Last change: 30
reference to a capturing subpattern earlier (i.e. to its
left) in the pattern, provided there have been that many
previous capturing left parentheses.
@ -1583,12 +1637,11 @@ BACK REFERENCES
matches any number of "a"s and also "aba", "ababbaa" etc. At
each iteration of the subpattern, the back reference matches
the character string corresponding to the previous
iteration. In order for this to work, the pattern must be
such that the first iteration does not need to match the
back reference. This can be done using alternation, as in
the example above, or by a quantifier with a minimum of
zero.
the character string corresponding to the previous itera-
tion. In order for this to work, the pattern must be such
that the first iteration does not need to match the back
reference. This can be done using alternation, as in the
example above, or by a quantifier with a minimum of zero.
@ -1741,9 +1794,9 @@ ONCE-ONLY SUBPATTERNS
This kind of parenthesis "locks up" the part of the pattern
it contains once it has matched, and a failure further into
the pattern is prevented from backtracking into it.
Backtracking past it to previous items, however, works as
normal.
the pattern is prevented from backtracking into it. Back-
tracking past it to previous items, however, works as nor-
mal.
An alternative description is that a subpattern of this type
matches the string of characters that an identical stan-
@ -2051,8 +2104,8 @@ UTF-8 SUPPORT
Running with PCRE_UTF8 set causes these changes in the way
PCRE works:
1. In a pattern, the escape sequence \x{...}, where the con-
tents of the braces is a string of hexadecimal digits, is
1. In a pattern, the escape sequence \x{...}, where the
contents of the braces is a string of hexadecimal digits, is
interpreted as a UTF-8 character whose code number is the
given hexadecimal number, for example: \x{1234}. This
inserts from one to six literal bytes into the pattern,
@ -2106,6 +2159,7 @@ UTF-8 SUPPORT
The following UTF-8 features of Perl 5.6 are not imple-
mented:
1. The escape sequence \C to match a single byte.
2. The use of Unicode tables and properties and escapes \p,
@ -2113,6 +2167,143 @@ UTF-8 SUPPORT
SAMPLE PROGRAM
The code below is a simple, complete demonstration program,
to get you started with using PCRE. This code is also sup-
plied in the file pcredemo.c in the PCRE distribution.
The program compiles the regular expression that is its
first argument, and matches it against the subject string in
its second argument. No options are set, and default charac-
ter tables are used. If matching succeeds, the program out-
puts the portion of the subject that matched, together with
the contents of any captured substrings.
On a Unix system that has PCRE installed in /usr/local, you
can compile the demonstration program using a command like
this:
gcc -o pcredemo pcredemo.c -I/usr/local/include
-L/usr/local/lib -lpcre
Then you can run simple tests like this:
./pcredemo 'cat|dog' 'the cat sat on the mat'
Note that there is a much more comprehensive test program,
called pcretest, which supports many more facilities for
testing regular expressions. The pcredemo program is pro-
vided as a simple coding example.
On some operating systems (e.g. Solaris) you may get an
error like this when you try to run pcredemo:
ld.so.1: a.out: fatal: libpcre.so.0: open failed: No such
file or directory
This is caused by the way shared library support works on
those systems. You need to add
-R/usr/local/lib
to the compile command to get round this problem. Here's the
code:
#include <stdio.h>
#include <string.h>
#include <pcre.h>
#define OVECCOUNT 30 /* should be a multiple of 3 */
int main(int argc, char **argv)
{
pcre *re;
const char *error;
int erroffset;
int ovector[OVECCOUNT];
int rc, i;
if (argc != 3)
{
printf("Two arguments required: a regex and a "
"subject string\n");
return 1;
}
/* Compile the regular expression in the first argument */
re = pcre_compile(
argv[1], /* the pattern */
0, /* default options */
&error, /* for error message */
&erroffset, /* for error offset */
NULL); /* use default character tables */
/* Compilation failed: print the error message and exit */
if (re == NULL)
{
printf("PCRE compilation failed at offset %d: %s\n",
erroffset, error);
return 1;
}
/* Compilation succeeded: match the subject in the second
argument */
rc = pcre_exec(
re, /* the compiled pattern */
NULL, /* we didn't study the pattern */
argv[2], /* the subject string */
(int)strlen(argv[2]), /* the length of the subject */
0, /* start at offset 0 in the subject */
0, /* default options */
ovector, /* vector for substring information */
OVECCOUNT); /* number of elements in the vector */
/* Matching failed: handle error cases */
if (rc < 0)
{
switch(rc)
{
case PCRE_ERROR_NOMATCH: printf("No match\n"); break;
/*
Handle other special cases if you like
*/
default: printf("Matching error %d\n", rc); break;
}
return 1;
}
/* Match succeded */
printf("Match succeeded\n");
/* The output vector wasn't big enough */
if (rc == 0)
{
rc = OVECCOUNT/3;
printf("ovector only has room for %d captured "
substrings\n", rc - 1);
}
/* Show substrings stored in the output vector */
for (i = 0; i < rc; i++)
{
char *substring_start = argv[2] + ovector[2*i];
int substring_length = ovector[2*i+1] - ovector[2*i];
printf("%2d: %.*s\n", i, substring_length,
substring_start);
}
return 0;
}
AUTHOR
Philip Hazel <ph10@cam.ac.uk>
University Computing Service,
@ -2120,6 +2311,5 @@ AUTHOR
Cambridge CB2 3QG, England.
Phone: +44 1223 334714
Last updated: 28 August 2000,
the 250th anniversary of the death of J.S. Bach.
Copyright (c) 1997-2000 University of Cambridge.
Last updated: 15 August 2001
Copyright (c) 1997-2001 University of Cambridge.

View File

@ -2,7 +2,7 @@
.SH NAME
pcregrep - a grep with Perl-compatible regular expressions.
.SH SYNOPSIS
.B pcregrep [-Vchilnsvx] pattern [file] ...
.B pcregrep [-Vcfhilnrsvx] pattern [file] ...
.SH DESCRIPTION
@ -32,6 +32,12 @@ Do not print individual lines; instead just print a count of the number of
lines that would otherwise have been printed. If several files are given, a
count is printed for each of them.
.TP
\fB-f\fIfilename\fR
Read patterns from the file, one per line, and match all patterns against each
line. There is a maximum of 100 patterns. Trailing white space is removed, and
blank lines are ignored. An empty file contains no patterns and therefore
matches nothing.
.TP
\fB-h\fR
Suppress printing of filenames when searching multiple files.
.TP
@ -46,6 +52,10 @@ once, on a separate line.
\fB-n\fR
Precede each line by its line number in the file.
.TP
\fB-r\fR
If any file is a directory, recursively scan the files it contains. Without
\fB-r\fR a directory is scanned as a normal file.
.TP
\fB-s\fR
Work silently, that is, display nothing except error messages.
The exit status indicates whether any matches were found.
@ -72,5 +82,7 @@ for syntax errors or inacessible files (even if matches were found).
.SH AUTHOR
Philip Hazel <ph10@cam.ac.uk>
Last updated: 15 August 2001
.br
Copyright (c) 1997-2000 University of Cambridge.
Copyright (c) 1997-2001 University of Cambridge.

View File

@ -22,7 +22,7 @@ pcregrep - a grep with Perl-compatible regular expressions.
</P>
<LI><A NAME="SEC2" HREF="#TOC1">SYNOPSIS</A>
<P>
<B>pcregrep [-Vchilnsvx] pattern [file] ...</B>
<B>pcregrep [-Vcfhilnrsvx] pattern [file] ...</B>
</P>
<LI><A NAME="SEC3" HREF="#TOC1">DESCRIPTION</A>
<P>
@ -55,6 +55,13 @@ lines that would otherwise have been printed. If several files are given, a
count is printed for each of them.
</P>
<P>
\fB-f<I>filename</I>
Read patterns from the file, one per line, and match all patterns against each
line. There is a maximum of 100 patterns. Trailing white space is removed, and
blank lines are ignored. An empty file contains no patterns and therefore
matches nothing.
</P>
<P>
<B>-h</B>
Suppress printing of filenames when searching multiple files.
</P>
@ -73,6 +80,11 @@ once, on a separate line.
Precede each line by its line number in the file.
</P>
<P>
<B>-r</B>
If any file is a directory, recursively scan the files it contains. Without
<B>-r</B> a directory is scanned as a normal file.
</P>
<P>
<B>-s</B>
Work silently, that is, display nothing except error messages.
The exit status indicates whether any matches were found.
@ -101,5 +113,8 @@ for syntax errors or inacessible files (even if matches were found).
<LI><A NAME="SEC7" HREF="#TOC1">AUTHOR</A>
<P>
Philip Hazel &#60;ph10@cam.ac.uk&#62;
</P>
<P>
Last updated: 15 August 2001
<BR>
Copyright (c) 1997-2000 University of Cambridge.
Copyright (c) 1997-2001 University of Cambridge.

View File

@ -4,7 +4,7 @@ NAME
SYNOPSIS
pcregrep [-Vchilnsvx] pattern [file] ...
pcregrep [-Vcfhilnrsvx] pattern [file] ...
@ -37,6 +37,14 @@ OPTIONS
wise have been printed. If several files are
given, a count is printed for each of them.
-ffilename
Read patterns from the file, one per line, and
match all patterns against each line. There is a
maximum of 100 patterns. Trailing white space is
removed, and blank lines are ignored. An empty
file contains no patterns and therefore matches
nothing.
-h Suppress printing of filenames when searching mul-
tiple files.
@ -44,12 +52,17 @@ OPTIONS
parisons.
-l Instead of printing lines from the files, just
print the names of the files containing lines that
would have been printed. Each file name is printed
once, on a separate line.
-n Precede each line by its line number in the file.
-r If any file is a directory, recursively scan the
files it contains. Without -r a directory is
scanned as a normal file.
-s Work silently, that is, display nothing except
error messages. The exit status indicates whether
any matches were found.
@ -83,5 +96,6 @@ DIAGNOSTICS
AUTHOR
Philip Hazel <ph10@cam.ac.uk>
Copyright (c) 1997-2000 University of Cambridge.
Last updated: 15 August 2001
Copyright (c) 1997-2001 University of Cambridge.

View File

@ -0,0 +1,282 @@
.TH PCRETEST 1
.SH NAME
pcretest - a program for testing Perl-compatible regular expressions.
.SH SYNOPSIS
.B pcretest "[-d] [-i] [-m] [-o osize] [-p] [-t] [source] [destination]"
\fBpcretest\fR was written as a test program for the PCRE regular expression
library itself, but it can also be used for experimenting with regular
expressions. This man page describes the features of the test program; for
details of the regular expressions themselves, see the \fBpcre\fR man page.
.SH OPTIONS
.TP 10
\fB-d\fR
Behave as if each regex had the \fB/D\fR modifier (see below); the internal
form is output after compilation.
.TP 10
\fB-i\fR
Behave as if each regex had the \fB/I\fR modifier; information about the
compiled pattern is given after compilation.
.TP 10
\fB-m\fR
Output the size of each compiled pattern after it has been compiled. This is
equivalent to adding /M to each regular expression. For compatibility with
earlier versions of pcretest, \fB-s\fR is a synonym for \fB-m\fR.
.TP 10
\fB-o\fR \fIosize\fR
Set the number of elements in the output vector that is used when calling PCRE
to be \fIosize\fR. The default value is 45, which is enough for 14 capturing
subexpressions. The vector size can be changed for individual matching calls by
including \\O in the data line (see below).
.TP 10
\fB-p\fR
Behave as if each regex has \fB/P\fR modifier; the POSIX wrapper API is used
to call PCRE. None of the other options has any effect when \fB-p\fR is set.
.TP 10
\fB-t\fR
Run each compile, study, and match 20000 times with a timer, and output
resulting time per compile or match (in milliseconds). Do not set \fB-t\fR with
\fB-m\fR, because you will then get the size output 20000 times and the timing
will be distorted.
.SH DESCRIPTION
If \fBpcretest\fR is given two filename arguments, it reads from the first and
writes to the second. If it is given only one filename argument, it reads from
that file and writes to stdout. Otherwise, it reads from stdin and writes to
stdout, and prompts for each line of input, using "re>" to prompt for regular
expressions, and "data>" to prompt for data lines.
The program handles any number of sets of input on a single input file. Each
set starts with a regular expression, and continues with any number of data
lines to be matched against the pattern. An empty line signals the end of the
data lines, at which point a new regular expression is read. The regular
expressions are given enclosed in any non-alphameric delimiters other than
backslash, for example
/(a|bc)x+yz/
White space before the initial delimiter is ignored. A regular expression may
be continued over several input lines, in which case the newline characters are
included within it. It is possible to include the delimiter within the pattern
by escaping it, for example
/abc\\/def/
If you do so, the escape and the delimiter form part of the pattern, but since
delimiters are always non-alphameric, this does not affect its interpretation.
If the terminating delimiter is immediately followed by a backslash, for
example,
/abc/\\
then a backslash is added to the end of the pattern. This is done to provide a
way of testing the error condition that arises if a pattern finishes with a
backslash, because
/abc\\/
is interpreted as the first line of a pattern that starts with "abc/", causing
pcretest to read the next line as a continuation of the regular expression.
.SH PATTERN MODIFIERS
The pattern may be followed by \fBi\fR, \fBm\fR, \fBs\fR, or \fBx\fR to set the
PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, or PCRE_EXTENDED options,
respectively. For example:
/caseless/i
These modifier letters have the same effect as they do in Perl. There are
others which set PCRE options that do not correspond to anything in Perl:
\fB/A\fR, \fB/E\fR, and \fB/X\fR set PCRE_ANCHORED, PCRE_DOLLAR_ENDONLY, and
PCRE_EXTRA respectively.
Searching for all possible matches within each subject string can be requested
by the \fB/g\fR or \fB/G\fR modifier. After finding a match, PCRE is called
again to search the remainder of the subject string. The difference between
\fB/g\fR and \fB/G\fR is that the former uses the \fIstartoffset\fR argument to
\fBpcre_exec()\fR to start searching at a new point within the entire string
(which is in effect what Perl does), whereas the latter passes over a shortened
substring. This makes a difference to the matching process if the pattern
begins with a lookbehind assertion (including \\b or \\B).
If any call to \fBpcre_exec()\fR in a \fB/g\fR or \fB/G\fR sequence matches an
empty string, the next call is done with the PCRE_NOTEMPTY and PCRE_ANCHORED
flags set in order to search for another, non-empty, match at the same point.
If this second match fails, the start offset is advanced by one, and the normal
match is retried. This imitates the way Perl handles such cases when using the
\fB/g\fR modifier or the \fBsplit()\fR function.
There are a number of other modifiers for controlling the way \fBpcretest\fR
operates.
The \fB/+\fR modifier requests that as well as outputting the substring that
matched the entire pattern, pcretest should in addition output the remainder of
the subject string. This is useful for tests where the subject contains
multiple copies of the same substring.
The \fB/L\fR modifier must be followed directly by the name of a locale, for
example,
/pattern/Lfr
For this reason, it must be the last modifier letter. The given locale is set,
\fBpcre_maketables()\fR is called to build a set of character tables for the
locale, and this is then passed to \fBpcre_compile()\fR when compiling the
regular expression. Without an \fB/L\fR modifier, NULL is passed as the tables
pointer; that is, \fB/L\fR applies only to the expression on which it appears.
The \fB/I\fR modifier requests that \fBpcretest\fR output information about the
compiled expression (whether it is anchored, has a fixed first character, and
so on). It does this by calling \fBpcre_fullinfo()\fR after compiling an
expression, and outputting the information it gets back. If the pattern is
studied, the results of that are also output.
The \fB/D\fR modifier is a PCRE debugging feature, which also assumes \fB/I\fR.
It causes the internal form of compiled regular expressions to be output after
compilation.
The \fB/S\fR modifier causes \fBpcre_study()\fR to be called after the
expression has been compiled, and the results used when the expression is
matched.
The \fB/M\fR modifier causes the size of memory block used to hold the compiled
pattern to be output.
The \fB/P\fR modifier causes \fBpcretest\fR to call PCRE via the POSIX wrapper
API rather than its native API. When this is done, all other modifiers except
\fB/i\fR, \fB/m\fR, and \fB/+\fR are ignored. REG_ICASE is set if \fB/i\fR is
present, and REG_NEWLINE is set if \fB/m\fR is present. The wrapper functions
force PCRE_DOLLAR_ENDONLY always, and PCRE_DOTALL unless REG_NEWLINE is set.
The \fB/8\fR modifier causes \fBpcretest\fR to call PCRE with the PCRE_UTF8
option set. This turns on the (currently incomplete) support for UTF-8
character handling in PCRE, provided that it was compiled with this support
enabled. This modifier also causes any non-printing characters in output
strings to be printed using the \\x{hh...} notation if they are valid UTF-8
sequences.
.SH DATA LINES
Before each data line is passed to \fBpcre_exec()\fR, leading and trailing
whitespace is removed, and it is then scanned for \\ escapes. The following are
recognized:
\\a alarm (= BEL)
\\b backspace
\\e escape
\\f formfeed
\\n newline
\\r carriage return
\\t tab
\\v vertical tab
\\nnn octal character (up to 3 octal digits)
\\xhh hexadecimal character (up to 2 hex digits)
\\x{hh...} hexadecimal UTF-8 character
\\A pass the PCRE_ANCHORED option to \fBpcre_exec()\fR
\\B pass the PCRE_NOTBOL option to \fBpcre_exec()\fR
\\Cdd call pcre_copy_substring() for substring dd
after a successful match (any decimal number
less than 32)
\\Gdd call pcre_get_substring() for substring dd
after a successful match (any decimal number
less than 32)
\\L call pcre_get_substringlist() after a
successful match
\\N pass the PCRE_NOTEMPTY option to \fBpcre_exec()\fR
\\Odd set the size of the output vector passed to
\fBpcre_exec()\fR to dd (any number of decimal
digits)
\\Z pass the PCRE_NOTEOL option to \fBpcre_exec()\fR
When \\O is used, it may be higher or lower than the size set by the \fB-O\fR
option (or defaulted to 45); \\O applies only to the call of \fBpcre_exec()\fR
for the line in which it appears.
A backslash followed by anything else just escapes the anything else. If the
very last character is a backslash, it is ignored. This gives a way of passing
an empty line as data, since a real empty line terminates the data input.
If \fB/P\fR was present on the regex, causing the POSIX wrapper API to be used,
only \fB\B\fR, and \fB\Z\fR have any effect, causing REG_NOTBOL and REG_NOTEOL
to be passed to \fBregexec()\fR respectively.
The use of \\x{hh...} to represent UTF-8 characters is not dependent on the use
of the \fB/8\fR modifier on the pattern. It is recognized always. There may be
any number of hexadecimal digits inside the braces. The result is from one to
six bytes, encoded according to the UTF-8 rules.
.SH OUTPUT FROM PCRETEST
When a match succeeds, pcretest outputs the list of captured substrings that
\fBpcre_exec()\fR returns, starting with number 0 for the string that matched
the whole pattern. Here is an example of an interactive pcretest run.
$ pcretest
PCRE version 2.06 08-Jun-1999
re> /^abc(\\d+)/
data> abc123
0: abc123
1: 123
data> xyz
No match
If the strings contain any non-printing characters, they are output as \\0x
escapes, or as \\x{...} escapes if the \fB/8\fR modifier was present on the
pattern. If the pattern has the \fB/+\fR modifier, then the output for
substring 0 is followed by the the rest of the subject string, identified by
"0+" like this:
re> /cat/+
data> cataract
0: cat
0+ aract
If the pattern has the \fB/g\fR or \fB/G\fR modifier, the results of successive
matching attempts are output in sequence, like this:
re> /\\Bi(\\w\\w)/g
data> Mississippi
0: iss
1: ss
0: iss
1: ss
0: ipp
1: pp
"No match" is output only if the first match attempt fails.
If any of the sequences \fB\\C\fR, \fB\\G\fR, or \fB\\L\fR are present in a
data line that is successfully matched, the substrings extracted by the
convenience functions are output with C, G, or L after the string number
instead of a colon. This is in addition to the normal full list. The string
length (that is, the return from the extraction function) is given in
parentheses after each string for \fB\\C\fR and \fB\\G\fR.
Note that while patterns can be continued over several lines (a plain ">"
prompt is used for continuations), data lines may not. However newlines can be
included in data by means of the \\n escape.
.SH AUTHOR
Philip Hazel <ph10@cam.ac.uk>
.br
University Computing Service,
.br
New Museums Site,
.br
Cambridge CB2 3QG, England.
.br
Phone: +44 1223 334714
Last updated: 15 August 2001
.br
Copyright (c) 1997-2001 University of Cambridge.

View File

@ -0,0 +1,369 @@
<HTML>
<HEAD>
<TITLE>pcretest specification</TITLE>
</HEAD>
<body bgcolor="#FFFFFF" text="#00005A">
<H1>pcretest specification</H1>
This HTML document has been generated automatically from the original man page.
If there is any nonsense in it, please consult the man page in case the
conversion went wrong.
<UL>
<LI><A NAME="TOC1" HREF="#SEC1">NAME</A>
<LI><A NAME="TOC2" HREF="#SEC2">SYNOPSIS</A>
<LI><A NAME="TOC3" HREF="#SEC3">OPTIONS</A>
<LI><A NAME="TOC4" HREF="#SEC4">DESCRIPTION</A>
<LI><A NAME="TOC5" HREF="#SEC5">PATTERN MODIFIERS</A>
<LI><A NAME="TOC6" HREF="#SEC6">DATA LINES</A>
<LI><A NAME="TOC7" HREF="#SEC7">OUTPUT FROM PCRETEST</A>
<LI><A NAME="TOC8" HREF="#SEC8">AUTHOR</A>
</UL>
<LI><A NAME="SEC1" HREF="#TOC1">NAME</A>
<P>
pcretest - a program for testing Perl-compatible regular expressions.
</P>
<LI><A NAME="SEC2" HREF="#TOC1">SYNOPSIS</A>
<P>
<B>pcretest [-d] [-i] [-m] [-o osize] [-p] [-t] [source] [destination]</B>
</P>
<P>
<B>pcretest</B> was written as a test program for the PCRE regular expression
library itself, but it can also be used for experimenting with regular
expressions. This man page describes the features of the test program; for
details of the regular expressions themselves, see the <B>pcre</B> man page.
</P>
<LI><A NAME="SEC3" HREF="#TOC1">OPTIONS</A>
<P>
<B>-d</B>
Behave as if each regex had the <B>/D</B> modifier (see below); the internal
form is output after compilation.
</P>
<P>
<B>-i</B>
Behave as if each regex had the <B>/I</B> modifier; information about the
compiled pattern is given after compilation.
</P>
<P>
<B>-m</B>
Output the size of each compiled pattern after it has been compiled. This is
equivalent to adding /M to each regular expression. For compatibility with
earlier versions of pcretest, <B>-s</B> is a synonym for <B>-m</B>.
</P>
<P>
<B>-o</B> <I>osize</I>
Set the number of elements in the output vector that is used when calling PCRE
to be <I>osize</I>. The default value is 45, which is enough for 14 capturing
subexpressions. The vector size can be changed for individual matching calls by
including \O in the data line (see below).
</P>
<P>
<B>-p</B>
Behave as if each regex has <B>/P</B> modifier; the POSIX wrapper API is used
to call PCRE. None of the other options has any effect when <B>-p</B> is set.
</P>
<P>
<B>-t</B>
Run each compile, study, and match 20000 times with a timer, and output
resulting time per compile or match (in milliseconds). Do not set <B>-t</B> with
<B>-m</B>, because you will then get the size output 20000 times and the timing
will be distorted.
</P>
<LI><A NAME="SEC4" HREF="#TOC1">DESCRIPTION</A>
<P>
If <B>pcretest</B> is given two filename arguments, it reads from the first and
writes to the second. If it is given only one filename argument, it reads from
that file and writes to stdout. Otherwise, it reads from stdin and writes to
stdout, and prompts for each line of input, using "re&#62;" to prompt for regular
expressions, and "data&#62;" to prompt for data lines.
</P>
<P>
The program handles any number of sets of input on a single input file. Each
set starts with a regular expression, and continues with any number of data
lines to be matched against the pattern. An empty line signals the end of the
data lines, at which point a new regular expression is read. The regular
expressions are given enclosed in any non-alphameric delimiters other than
backslash, for example
</P>
<P>
<PRE>
/(a|bc)x+yz/
</PRE>
</P>
<P>
White space before the initial delimiter is ignored. A regular expression may
be continued over several input lines, in which case the newline characters are
included within it. It is possible to include the delimiter within the pattern
by escaping it, for example
</P>
<P>
<PRE>
/abc\/def/
</PRE>
</P>
<P>
If you do so, the escape and the delimiter form part of the pattern, but since
delimiters are always non-alphameric, this does not affect its interpretation.
If the terminating delimiter is immediately followed by a backslash, for
example,
</P>
<P>
<PRE>
/abc/\
</PRE>
</P>
<P>
then a backslash is added to the end of the pattern. This is done to provide a
way of testing the error condition that arises if a pattern finishes with a
backslash, because
</P>
<P>
<PRE>
/abc\/
</PRE>
</P>
<P>
is interpreted as the first line of a pattern that starts with "abc/", causing
pcretest to read the next line as a continuation of the regular expression.
</P>
<LI><A NAME="SEC5" HREF="#TOC1">PATTERN MODIFIERS</A>
<P>
The pattern may be followed by <B>i</B>, <B>m</B>, <B>s</B>, or <B>x</B> to set the
PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, or PCRE_EXTENDED options,
respectively. For example:
</P>
<P>
<PRE>
/caseless/i
</PRE>
</P>
<P>
These modifier letters have the same effect as they do in Perl. There are
others which set PCRE options that do not correspond to anything in Perl:
<B>/A</B>, <B>/E</B>, and <B>/X</B> set PCRE_ANCHORED, PCRE_DOLLAR_ENDONLY, and
PCRE_EXTRA respectively.
</P>
<P>
Searching for all possible matches within each subject string can be requested
by the <B>/g</B> or <B>/G</B> modifier. After finding a match, PCRE is called
again to search the remainder of the subject string. The difference between
<B>/g</B> and <B>/G</B> is that the former uses the <I>startoffset</I> argument to
<B>pcre_exec()</B> to start searching at a new point within the entire string
(which is in effect what Perl does), whereas the latter passes over a shortened
substring. This makes a difference to the matching process if the pattern
begins with a lookbehind assertion (including \b or \B).
</P>
<P>
If any call to <B>pcre_exec()</B> in a <B>/g</B> or <B>/G</B> sequence matches an
empty string, the next call is done with the PCRE_NOTEMPTY and PCRE_ANCHORED
flags set in order to search for another, non-empty, match at the same point.
If this second match fails, the start offset is advanced by one, and the normal
match is retried. This imitates the way Perl handles such cases when using the
<B>/g</B> modifier or the <B>split()</B> function.
</P>
<P>
There are a number of other modifiers for controlling the way <B>pcretest</B>
operates.
</P>
<P>
The <B>/+</B> modifier requests that as well as outputting the substring that
matched the entire pattern, pcretest should in addition output the remainder of
the subject string. This is useful for tests where the subject contains
multiple copies of the same substring.
</P>
<P>
The <B>/L</B> modifier must be followed directly by the name of a locale, for
example,
</P>
<P>
<PRE>
/pattern/Lfr
</PRE>
</P>
<P>
For this reason, it must be the last modifier letter. The given locale is set,
<B>pcre_maketables()</B> is called to build a set of character tables for the
locale, and this is then passed to <B>pcre_compile()</B> when compiling the
regular expression. Without an <B>/L</B> modifier, NULL is passed as the tables
pointer; that is, <B>/L</B> applies only to the expression on which it appears.
</P>
<P>
The <B>/I</B> modifier requests that <B>pcretest</B> output information about the
compiled expression (whether it is anchored, has a fixed first character, and
so on). It does this by calling <B>pcre_fullinfo()</B> after compiling an
expression, and outputting the information it gets back. If the pattern is
studied, the results of that are also output.
</P>
<P>
The <B>/D</B> modifier is a PCRE debugging feature, which also assumes <B>/I</B>.
It causes the internal form of compiled regular expressions to be output after
compilation.
</P>
<P>
The <B>/S</B> modifier causes <B>pcre_study()</B> to be called after the
expression has been compiled, and the results used when the expression is
matched.
</P>
<P>
The <B>/M</B> modifier causes the size of memory block used to hold the compiled
pattern to be output.
</P>
<P>
The <B>/P</B> modifier causes <B>pcretest</B> to call PCRE via the POSIX wrapper
API rather than its native API. When this is done, all other modifiers except
<B>/i</B>, <B>/m</B>, and <B>/+</B> are ignored. REG_ICASE is set if <B>/i</B> is
present, and REG_NEWLINE is set if <B>/m</B> is present. The wrapper functions
force PCRE_DOLLAR_ENDONLY always, and PCRE_DOTALL unless REG_NEWLINE is set.
</P>
<P>
The <B>/8</B> modifier causes <B>pcretest</B> to call PCRE with the PCRE_UTF8
option set. This turns on the (currently incomplete) support for UTF-8
character handling in PCRE, provided that it was compiled with this support
enabled. This modifier also causes any non-printing characters in output
strings to be printed using the \x{hh...} notation if they are valid UTF-8
sequences.
</P>
<LI><A NAME="SEC6" HREF="#TOC1">DATA LINES</A>
<P>
Before each data line is passed to <B>pcre_exec()</B>, leading and trailing
whitespace is removed, and it is then scanned for \ escapes. The following are
recognized:
</P>
<P>
<PRE>
\a alarm (= BEL)
\b backspace
\e escape
\f formfeed
\n newline
\r carriage return
\t tab
\v vertical tab
\nnn octal character (up to 3 octal digits)
\xhh hexadecimal character (up to 2 hex digits)
\x{hh...} hexadecimal UTF-8 character
</PRE>
</P>
<P>
<PRE>
\A pass the PCRE_ANCHORED option to <B>pcre_exec()</B>
\B pass the PCRE_NOTBOL option to <B>pcre_exec()</B>
\Cdd call pcre_copy_substring() for substring dd
after a successful match (any decimal number
less than 32)
\Gdd call pcre_get_substring() for substring dd
after a successful match (any decimal number
less than 32)
\L call pcre_get_substringlist() after a
successful match
\N pass the PCRE_NOTEMPTY option to <B>pcre_exec()</B>
\Odd set the size of the output vector passed to
<B>pcre_exec()</B> to dd (any number of decimal
digits)
\Z pass the PCRE_NOTEOL option to <B>pcre_exec()</B>
</PRE>
</P>
<P>
When \O is used, it may be higher or lower than the size set by the <B>-O</B>
option (or defaulted to 45); \O applies only to the call of <B>pcre_exec()</B>
for the line in which it appears.
</P>
<P>
A backslash followed by anything else just escapes the anything else. If the
very last character is a backslash, it is ignored. This gives a way of passing
an empty line as data, since a real empty line terminates the data input.
</P>
<P>
If <B>/P</B> was present on the regex, causing the POSIX wrapper API to be used,
only <B>\B</B>, and <B>\Z</B> have any effect, causing REG_NOTBOL and REG_NOTEOL
to be passed to <B>regexec()</B> respectively.
</P>
<P>
The use of \x{hh...} to represent UTF-8 characters is not dependent on the use
of the <B>/8</B> modifier on the pattern. It is recognized always. There may be
any number of hexadecimal digits inside the braces. The result is from one to
six bytes, encoded according to the UTF-8 rules.
</P>
<LI><A NAME="SEC7" HREF="#TOC1">OUTPUT FROM PCRETEST</A>
<P>
When a match succeeds, pcretest outputs the list of captured substrings that
<B>pcre_exec()</B> returns, starting with number 0 for the string that matched
the whole pattern. Here is an example of an interactive pcretest run.
</P>
<P>
<PRE>
$ pcretest
PCRE version 2.06 08-Jun-1999
</PRE>
</P>
<P>
<PRE>
re&#62; /^abc(\d+)/
data&#62; abc123
0: abc123
1: 123
data&#62; xyz
No match
</PRE>
</P>
<P>
If the strings contain any non-printing characters, they are output as \0x
escapes, or as \x{...} escapes if the <B>/8</B> modifier was present on the
pattern. If the pattern has the <B>/+</B> modifier, then the output for
substring 0 is followed by the the rest of the subject string, identified by
"0+" like this:
</P>
<P>
<PRE>
re&#62; /cat/+
data&#62; cataract
0: cat
0+ aract
</PRE>
</P>
<P>
If the pattern has the <B>/g</B> or <B>/G</B> modifier, the results of successive
matching attempts are output in sequence, like this:
</P>
<P>
<PRE>
re&#62; /\Bi(\w\w)/g
data&#62; Mississippi
0: iss
1: ss
0: iss
1: ss
0: ipp
1: pp
</PRE>
</P>
<P>
"No match" is output only if the first match attempt fails.
</P>
<P>
If any of the sequences <B>\C</B>, <B>\G</B>, or <B>\L</B> are present in a
data line that is successfully matched, the substrings extracted by the
convenience functions are output with C, G, or L after the string number
instead of a colon. This is in addition to the normal full list. The string
length (that is, the return from the extraction function) is given in
parentheses after each string for <B>\C</B> and <B>\G</B>.
</P>
<P>
Note that while patterns can be continued over several lines (a plain "&#62;"
prompt is used for continuations), data lines may not. However newlines can be
included in data by means of the \n escape.
</P>
<LI><A NAME="SEC8" HREF="#TOC1">AUTHOR</A>
<P>
Philip Hazel &#60;ph10@cam.ac.uk&#62;
<BR>
University Computing Service,
<BR>
New Museums Site,
<BR>
Cambridge CB2 3QG, England.
<BR>
Phone: +44 1223 334714
</P>
<P>
Last updated: 15 August 2001
<BR>
Copyright (c) 1997-2001 University of Cambridge.

View File

@ -1,246 +1,319 @@
The pcretest program
--------------------
NAME
pcretest - a program for testing Perl-compatible regular
expressions.
This program is intended for testing PCRE, but it can also be used for
experimenting with regular expressions.
If it is given two filename arguments, it reads from the first and writes to
the second. If it is given only one filename argument, it reads from that file
and writes to stdout. Otherwise, it reads from stdin and writes to stdout, and
prompts for each line of input, using "re>" to prompt for regular expressions,
and "data>" to prompt for data lines.
The program handles any number of sets of input on a single input file. Each
set starts with a regular expression, and continues with any number of data
lines to be matched against the pattern. An empty line signals the end of the
data lines, at which point a new regular expression is read. The regular
expressions are given enclosed in any non-alphameric delimiters other than
backslash, for example
SYNOPSIS
pcretest [-d] [-i] [-m] [-o osize] [-p] [-t] [source] [des-
tination]
/(a|bc)x+yz/
pcretest was written as a test program for the PCRE regular
expression library itself, but it can also be used for
experimenting with regular expressions. This man page
describes the features of the test program; for details of
the regular expressions themselves, see the pcre man page.
White space before the initial delimiter is ignored. A regular expression may
be continued over several input lines, in which case the newline characters are
included within it. See the test input files in the testdata directory for many
examples. It is possible to include the delimiter within the pattern by
escaping it, for example
/abc\/def/
If you do so, the escape and the delimiter form part of the pattern, but since
delimiters are always non-alphameric, this does not affect its interpretation.
If the terminating delimiter is immediately followed by a backslash, for
example,
OPTIONS
-d Behave as if each regex had the /D modifier (see
below); the internal form is output after compila-
tion.
/abc/\
-i Behave as if each regex had the /I modifier;
information about the compiled pattern is given
after compilation.
then a backslash is added to the end of the pattern. This is done to provide a
way of testing the error condition that arises if a pattern finishes with a
backslash, because
-m Output the size of each compiled pattern after it
has been compiled. This is equivalent to adding /M
to each regular expression. For compatibility with
earlier versions of pcretest, -s is a synonym for
-m.
/abc\/
-o osize Set the number of elements in the output vector
that is used when calling PCRE to be osize. The
default value is 45, which is enough for 14 cap-
turing subexpressions. The vector size can be
changed for individual matching calls by including
\O in the data line (see below).
-p Behave as if each regex has /P modifier; the POSIX
wrapper API is used to call PCRE. None of the
other options has any effect when -p is set.
-t Run each compile, study, and match 20000 times
with a timer, and output resulting time per com-
pile or match (in milliseconds). Do not set -t
with -m, because you will then get the size output
20000 times and the timing will be distorted.
DESCRIPTION
If pcretest is given two filename arguments, it reads from
the first and writes to the second. If it is given only one
SunOS 5.8 Last change: 1
filename argument, it reads from that file and writes to
stdout. Otherwise, it reads from stdin and writes to stdout,
and prompts for each line of input, using "re>" to prompt
for regular expressions, and "data>" to prompt for data
lines.
The program handles any number of sets of input on a single
input file. Each set starts with a regular expression, and
continues with any number of data lines to be matched
against the pattern. An empty line signals the end of the
data lines, at which point a new regular expression is read.
The regular expressions are given enclosed in any non-
alphameric delimiters other than backslash, for example
/(a|bc)x+yz/
White space before the initial delimiter is ignored. A regu-
lar expression may be continued over several input lines, in
which case the newline characters are included within it. It
is possible to include the delimiter within the pattern by
escaping it, for example
/abc\/def/
If you do so, the escape and the delimiter form part of the
pattern, but since delimiters are always non-alphameric,
this does not affect its interpretation. If the terminating
delimiter is immediately followed by a backslash, for exam-
ple,
/abc/\
then a backslash is added to the end of the pattern. This is
done to provide a way of testing the error condition that
arises if a pattern finishes with a backslash, because
/abc\/
is interpreted as the first line of a pattern that starts
with "abc/", causing pcretest to read the next line as a
continuation of the regular expression.
is interpreted as the first line of a pattern that starts with "abc/", causing
pcretest to read the next line as a continuation of the regular expression.
PATTERN MODIFIERS
-----------------
The pattern may be followed by i, m, s, or x to set the
PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, or PCRE_EXTENDED
options, respectively. For example:
The pattern may be followed by i, m, s, or x to set the PCRE_CASELESS,
PCRE_MULTILINE, PCRE_DOTALL, or PCRE_EXTENDED options, respectively. For
example:
/caseless/i
/caseless/i
These modifier letters have the same effect as they do in
Perl. There are others which set PCRE options that do not
correspond to anything in Perl: /A, /E, and /X set
PCRE_ANCHORED, PCRE_DOLLAR_ENDONLY, and PCRE_EXTRA respec-
tively.
These modifier letters have the same effect as they do in Perl. There are
others which set PCRE options that do not correspond to anything in Perl: /A,
/E, and /X set PCRE_ANCHORED, PCRE_DOLLAR_ENDONLY, and PCRE_EXTRA respectively.
Searching for all possible matches within each subject
string can be requested by the /g or /G modifier. After
finding a match, PCRE is called again to search the
remainder of the subject string. The difference between /g
and /G is that the former uses the startoffset argument to
pcre_exec() to start searching at a new point within the
entire string (which is in effect what Perl does), whereas
the latter passes over a shortened substring. This makes a
difference to the matching process if the pattern begins
with a lookbehind assertion (including \b or \B).
Searching for all possible matches within each subject string can be requested
by the /g or /G modifier. After finding a match, PCRE is called again to search
the remainder of the subject string. The difference between /g and /G is that
the former uses the startoffset argument to pcre_exec() to start searching at
a new point within the entire string (which is in effect what Perl does),
whereas the latter passes over a shortened substring. This makes a difference
to the matching process if the pattern begins with a lookbehind assertion
(including \b or \B).
If any call to pcre_exec() in a /g or /G sequence matches an
empty string, the next call is done with the PCRE_NOTEMPTY
and PCRE_ANCHORED flags set in order to search for another,
non-empty, match at the same point. If this second match
fails, the start offset is advanced by one, and the normal
match is retried. This imitates the way Perl handles such
cases when using the /g modifier or the split() function.
If any call to pcre_exec() in a /g or /G sequence matches an empty string, the
next call is done with the PCRE_NOTEMPTY and PCRE_ANCHORED flags set in order
to search for another, non-empty, match at the same point. If this second match
fails, the start offset is advanced by one, and the normal match is retried.
This imitates the way Perl handles such cases when using the /g modifier or the
split() function.
There are a number of other modifiers for controlling the
way pcretest operates.
There are a number of other modifiers for controlling the way pcretest
operates.
The /+ modifier requests that as well as outputting the sub-
string that matched the entire pattern, pcretest should in
addition output the remainder of the subject string. This is
useful for tests where the subject contains multiple copies
of the same substring.
The /+ modifier requests that as well as outputting the substring that matched
the entire pattern, pcretest should in addition output the remainder of the
subject string. This is useful for tests where the subject contains multiple
copies of the same substring.
The /L modifier must be followed directly by the name of a
locale, for example,
The /L modifier must be followed directly by the name of a locale, for example,
/pattern/Lfr
/pattern/Lfr
For this reason, it must be the last modifier letter. The
given locale is set, pcre_maketables() is called to build a
set of character tables for the locale, and this is then
passed to pcre_compile() when compiling the regular expres-
sion. Without an /L modifier, NULL is passed as the tables
pointer; that is, /L applies only to the expression on which
it appears.
For this reason, it must be the last modifier letter. The given locale is set,
pcre_maketables() is called to build a set of character tables for the locale,
and this is then passed to pcre_compile() when compiling the regular
expression. Without an /L modifier, NULL is passed as the tables pointer; that
is, /L applies only to the expression on which it appears.
The /I modifier requests that pcretest output information
about the compiled expression (whether it is anchored, has a
fixed first character, and so on). It does this by calling
pcre_fullinfo() after compiling an expression, and output-
ting the information it gets back. If the pattern is stu-
died, the results of that are also output.
The /D modifier is a PCRE debugging feature, which also
assumes /I. It causes the internal form of compiled regular
expressions to be output after compilation.
The /I modifier requests that pcretest output information about the compiled
expression (whether it is anchored, has a fixed first character, and so on). It
does this by calling pcre_fullinfo() after compiling an expression, and
outputting the information it gets back. If the pattern is studied, the results
of that are also output.
The /S modifier causes pcre_study() to be called after the
expression has been compiled, and the results used when the
expression is matched.
The /D modifier is a PCRE debugging feature, which also assumes /I. It causes
the internal form of compiled regular expressions to be output after
compilation.
The /M modifier causes the size of memory block used to hold
the compiled pattern to be output.
The /S modifier causes pcre_study() to be called after the expression has been
compiled, and the results used when the expression is matched.
The /P modifier causes pcretest to call PCRE via the POSIX
wrapper API rather than its native API. When this is done,
all other modifiers except /i, /m, and /+ are ignored.
REG_ICASE is set if /i is present, and REG_NEWLINE is set if
/m is present. The wrapper functions force
PCRE_DOLLAR_ENDONLY always, and PCRE_DOTALL unless
REG_NEWLINE is set.
The /M modifier causes the size of memory block used to hold the compiled
pattern to be output.
The /8 modifier causes pcretest to call PCRE with the
PCRE_UTF8 option set. This turns on the (currently incom-
plete) support for UTF-8 character handling in PCRE, pro-
vided that it was compiled with this support enabled. This
modifier also causes any non-printing characters in output
strings to be printed using the \x{hh...} notation if they
are valid UTF-8 sequences.
The /P modifier causes pcretest to call PCRE via the POSIX wrapper API rather
than its native API. When this is done, all other modifiers except /i, /m, and
/+ are ignored. REG_ICASE is set if /i is present, and REG_NEWLINE is set if /m
is present. The wrapper functions force PCRE_DOLLAR_ENDONLY always, and
PCRE_DOTALL unless REG_NEWLINE is set.
The /8 modifier causes pcretest to call PCRE with the PCRE_UTF8 option set.
This turns on the (currently incomplete) support for UTF-8 character handling
in PCRE, provided that it was compiled with this support enabled. This modifier
also causes any non-printing characters in output strings to be printed using
the \x{hh...} notation if they are valid UTF-8 sequences.
DATA LINES
----------
Before each data line is passed to pcre_exec(), leading and
trailing whitespace is removed, and it is then scanned for \
escapes. The following are recognized:
Before each data line is passed to pcre_exec(), leading and trailing whitespace
is removed, and it is then scanned for \ escapes. The following are recognized:
\a alarm (= BEL)
\b backspace
\e escape
\f formfeed
\n newline
\r carriage return
\t tab
\v vertical tab
\nnn octal character (up to 3 octal digits)
\xhh hexadecimal character (up to 2 hex digits)
\x{hh...} hexadecimal UTF-8 character
\a alarm (= BEL)
\b backspace
\e escape
\f formfeed
\n newline
\r carriage return
\t tab
\v vertical tab
\nnn octal character (up to 3 octal digits)
\xhh hexadecimal character (up to 2 hex digits)
\x{hh...} hexadecimal UTF-8 character
\A pass the PCRE_ANCHORED option to pcre_exec()
\B pass the PCRE_NOTBOL option to pcre_exec()
\Cdd call pcre_copy_substring() for substring dd
after a successful match (any decimal number
less than 32)
\Gdd call pcre_get_substring() for substring dd
\A pass the PCRE_ANCHORED option to pcre_exec()
\B pass the PCRE_NOTBOL option to pcre_exec()
\Cdd call pcre_copy_substring() for substring dd after a successful
match (any decimal number less than 32)
\Gdd call pcre_get_substring() for substring dd after a successful
match (any decimal number less than 32)
\L call pcre_get_substringlist() after a successful match
\N pass the PCRE_NOTEMPTY option to pcre_exec()
\Odd set the size of the output vector passed to pcre_exec() to dd
(any number of decimal digits)
\Z pass the PCRE_NOTEOL option to pcre_exec()
after a successful match (any decimal number
less than 32)
\L call pcre_get_substringlist() after a
successful match
\N pass the PCRE_NOTEMPTY option to pcre_exec()
\Odd set the size of the output vector passed to
pcre_exec() to dd (any number of decimal
digits)
\Z pass the PCRE_NOTEOL option to pcre_exec()
A backslash followed by anything else just escapes the anything else. If the
very last character is a backslash, it is ignored. This gives a way of passing
an empty line as data, since a real empty line terminates the data input.
When \O is used, it may be higher or lower than the size set
by the -O option (or defaulted to 45); \O applies only to
the call of pcre_exec() for the line in which it appears.
If /P was present on the regex, causing the POSIX wrapper API to be used, only
\B, and \Z have any effect, causing REG_NOTBOL and REG_NOTEOL to be passed to
regexec() respectively.
A backslash followed by anything else just escapes the any-
thing else. If the very last character is a backslash, it is
ignored. This gives a way of passing an empty line as data,
since a real empty line terminates the data input.
If /P was present on the regex, causing the POSIX wrapper
API to be used, only B, and Z have any effect, causing
REG_NOTBOL and REG_NOTEOL to be passed to regexec() respec-
tively.
The use of \x{hh...} to represent UTF-8 characters is not
dependent on the use of the /8 modifier on the pattern. It
is recognized always. There may be any number of hexadecimal
digits inside the braces. The result is from one to six
bytes, encoded according to the UTF-8 rules.
The use of \x{hh...} to represent UTF-8 characters is not dependent on the use
of the /8 modifier on the pattern. It is recognized always. There may be any
number of hexadecimal digits inside the braces. The result is from one to six
bytes, encoded according to the UTF-8 rules.
OUTPUT FROM PCRETEST
--------------------
When a match succeeds, pcretest outputs the list of captured
substrings that pcre_exec() returns, starting with number 0
for the string that matched the whole pattern. Here is an
example of an interactive pcretest run.
When a match succeeds, pcretest outputs the list of captured substrings that
pcre_exec() returns, starting with number 0 for the string that matched the
whole pattern. Here is an example of an interactive pcretest run.
$ pcretest
PCRE version 2.06 08-Jun-1999
$ pcretest
PCRE version 2.06 08-Jun-1999
re> /^abc(\d+)/
data> abc123
0: abc123
1: 123
data> xyz
No match
re> /^abc(\d+)/
data> abc123
0: abc123
1: 123
data> xyz
No match
If the strings contain any non-printing characters, they are
output as \0x escapes, or as \x{...} escapes if the /8
modifier was present on the pattern. If the pattern has the
/+ modifier, then the output for substring 0 is followed by
the the rest of the subject string, identified by "0+" like
this:
If the strings contain any non-printing characters, they are output as \0x
escapes, or as \x{...} escapes if the /8 modifier was present on the pattern.
If the pattern has the /+ modifier, then the output for substring 0 is followed
by the the rest of the subject string, identified by "0+" like this:
re> /cat/+
data> cataract
0: cat
0+ aract
re> /cat/+
data> cataract
0: cat
0+ aract
If the pattern has the /g or /G modifier, the results of
successive matching attempts are output in sequence, like
this:
If the pattern has the /g or /G modifier, the results of successive matching
attempts are output in sequence, like this:
re> /\Bi(\w\w)/g
data> Mississippi
0: iss
1: ss
0: iss
1: ss
0: ipp
1: pp
re> /\Bi(\w\w)/g
data> Mississippi
0: iss
1: ss
0: iss
1: ss
0: ipp
1: pp
"No match" is output only if the first match attempt fails.
"No match" is output only if the first match attempt fails.
If any of the sequences \C, \G, or \L are present in a data
line that is successfully matched, the substrings extracted
by the convenience functions are output with C, G, or L
after the string number instead of a colon. This is in addi-
tion to the normal full list. The string length (that is,
the return from the extraction function) is given in
parentheses after each string for \C and \G.
If any of \C, \G, or \L are present in a data line that is successfully
matched, the substrings extracted by the convenience functions are output with
C, G, or L after the string number instead of a colon. This is in addition to
the normal full list. The string length (that is, the return from the
extraction function) is given in parentheses after each string for \C and \G.
Note that while patterns can be continued over several lines (a plain ">"
prompt is used for continuations), data lines may not. However newlines can be
included in data by means of the \n escape.
Note that while patterns can be continued over several lines
(a plain ">" prompt is used for continuations), data lines
may not. However newlines can be included in data by means
of the \n escape.
COMMAND LINE OPTIONS
--------------------
If the -p option is given to pcretest, it is equivalent to adding /P to each
regular expression: the POSIX wrapper API is used to call PCRE. None of the
following flags has any effect in this case.
AUTHOR
Philip Hazel <ph10@cam.ac.uk>
University Computing Service,
New Museums Site,
Cambridge CB2 3QG, England.
Phone: +44 1223 334714
If the option -d is given to pcretest, it is equivalent to adding /D to each
regular expression: the internal form is output after compilation.
If the option -i is given to pcretest, it is equivalent to adding /I to each
regular expression: information about the compiled pattern is given after
compilation.
If the option -m is given to pcretest, it outputs the size of each compiled
pattern after it has been compiled. It is equivalent to adding /M to each
regular expression. For compatibility with earlier versions of pcretest, -s is
a synonym for -m.
If the -t option is given, each compile, study, and match is run 20000 times
while being timed, and the resulting time per compile or match is output in
milliseconds. Do not set -t with -m, because you will then get the size output
20000 times and the timing will be distorted. If you want to change the number
of repetitions used for timing, edit the definition of LOOPREPEAT at the top of
pcretest.c
Philip Hazel <ph10@cam.ac.uk>
August 2000
Last updated: 15 August 2001
Copyright (c) 1997-2001 University of Cambridge.

View File

@ -9,7 +9,7 @@ the file Tech.Notes for some information on the internals.
Written by: Philip Hazel <ph10@cam.ac.uk>
Copyright (c) 1997-2000 University of Cambridge
Copyright (c) 1997-2001 University of Cambridge
-----------------------------------------------------------------------------
Permission is granted to anyone to use this software for any purpose on any

View File

@ -9,7 +9,7 @@ the file Tech.Notes for some information on the internals.
Written by: Philip Hazel <ph10@cam.ac.uk>
Copyright (c) 1997-2000 University of Cambridge
Copyright (c) 1997-2001 University of Cambridge
-----------------------------------------------------------------------------
Permission is granted to anyone to use this software for any purpose on any
@ -38,13 +38,21 @@ modules, but which are not relevant to the outside. */
/* Get the definitions provided by running "configure" */
#ifdef PHP_WIN32
#include "config.w32.h"
# include "config.w32.h"
#elif defined(NETWARE)
#include "config.nw.h"
# include "config.nw.h"
#else
#include "php_config.h"
# include "php_config.h"
#endif
/* The value of NEWLINE determines the newline character. The default is to
leave it up to the compiler, but some sites want to force a particular value.
On Unix systems, "configure" can be used to override this default. */
#ifndef NEWLINE
#define NEWLINE '\n'
#endif
/* To cope with SunOS4 and other systems that lack memmove() but have bcopy(),
define a macro for memmove() if HAVE_MEMMOVE is false, provided that HAVE_BCOPY
is set. Otherwise, include an emulating function for those systems that have
@ -129,12 +137,36 @@ typedef int BOOL;
#define FALSE 0
#define TRUE 1
/* Escape items that are just an encoding of a particular data value. Note that
ESC_N is defined as yet another macro, which is set in config.h to either \n
(the default) or \r (which some people want). */
#ifndef ESC_E
#define ESC_E 27
#endif
#ifndef ESC_F
#define ESC_F '\f'
#endif
#ifndef ESC_N
#define ESC_N NEWLINE
#endif
#ifndef ESC_R
#define ESC_R '\r'
#endif
#ifndef ESC_T
#define ESC_T '\t'
#endif
/* These are escaped items that aren't just an encoding of a particular data
value such as \n. They must have non-zero values, as check_escape() returns
their negation. Also, they must appear in the same order as in the opcode
definitions below, up to ESC_z. The final one must be ESC_REF as subsequent
values are used for \1, \2, \3, etc. There is a test in the code for an escape
greater than ESC_b and less than ESC_X to detect the types that may be
greater than ESC_b and less than ESC_Z to detect the types that may be
repeated. If any new escapes are put in-between that don't consume a character,
that code will have to change. */
@ -230,19 +262,26 @@ enum {
OP_ONCE, /* Once matched, don't back up into the subpattern */
OP_COND, /* Conditional group */
OP_CREF, /* Used to hold an extraction string number */
OP_CREF, /* Used to hold an extraction string number (cond ref) */
OP_BRAZERO, /* These two must remain together and in this */
OP_BRAMINZERO, /* order. */
OP_BRANUMBER, /* Used for extracting brackets whose number is greater
than can fit into an opcode. */
OP_BRA /* This and greater values are used for brackets that
extract substrings. */
extract substrings up to a basic limit. After that,
use is made of OP_BRANUMBER. */
};
/* The highest extraction number. This is limited by the number of opcodes
left after OP_BRA, i.e. 255 - OP_BRA. We actually set it somewhat lower. */
/* The highest extraction number before we have to start using additional
bytes. (Originally PCRE didn't have support for extraction counts highter than
this number.) The value is limited by the number of opcodes left after OP_BRA,
i.e. 255 - OP_BRA. We actually set it a bit lower to leave room for additional
opcodes. */
#define EXTRACT_MAX 99
#define EXTRACT_BASIC_MAX 150
/* The texts of compile-time error messages are defined as macros here so that
they can be accessed by the POSIX wrapper and converted into error codes. Yes,
@ -261,13 +300,13 @@ just to accommodate the POSIX wrapper. */
#define ERR10 "operand of unlimited repeat could match the empty string"
#define ERR11 "internal error: unexpected repeat"
#define ERR12 "unrecognized character after (?"
#define ERR13 "too many capturing parenthesized sub-patterns"
#define ERR13 "unused error"
#define ERR14 "missing )"
#define ERR15 "back reference to non-existent subpattern"
#define ERR16 "erroffset passed as NULL"
#define ERR17 "unknown option bit(s) set"
#define ERR18 "missing ) after comment"
#define ERR19 "too many sets of parentheses"
#define ERR19 "parentheses nested too deeply"
#define ERR20 "regular expression too large"
#define ERR21 "failed to get memory"
#define ERR22 "unmatched parentheses"
@ -302,8 +341,8 @@ typedef struct real_pcre {
size_t size;
const unsigned char *tables;
unsigned long int options;
uschar top_bracket;
uschar top_backref;
unsigned short int top_bracket;
unsigned short int top_backref;
uschar first_char;
uschar req_char;
uschar code[1];

View File

@ -8,7 +8,7 @@ and semantics are as close as possible to those of the Perl 5 language.
Written by: Philip Hazel <ph10@cam.ac.uk>
Copyright (c) 1997-2000 University of Cambridge
Copyright (c) 1997-2001 University of Cambridge
-----------------------------------------------------------------------------
Permission is granted to anyone to use this software for any purpose on any
@ -58,7 +58,7 @@ Arguments: none
Returns: pointer to the contiguous block of data
*/
unsigned const char *
const unsigned char *
pcre_maketables(void)
{
unsigned char *yield, *p;

View File

@ -9,7 +9,7 @@ the file Tech.Notes for some information on the internals.
Written by: Philip Hazel <ph10@cam.ac.uk>
Copyright (c) 1997-2000 University of Cambridge
Copyright (c) 1997-2001 University of Cambridge
-----------------------------------------------------------------------------
Permission is granted to anyone to use this software for any purpose on any
@ -60,8 +60,11 @@ the external pcre header. */
#endif
/* Number of items on the nested bracket stacks at compile time. This should
not be set greater than 200. */
/* Maximum number of items on the nested bracket stacks at compile time. This
applies to the nesting of all kinds of parentheses. It does not limit
un-nested, non-capturing parentheses. This number can be made bigger if
necessary - it is used to dimension one int and one unsigned char vector at
compile time. */
#define BRASTACK_SIZE 200
@ -95,7 +98,7 @@ static const char *OP_names[] = {
"class", "Ref", "Recurse",
"Alt", "Ket", "KetRmax", "KetRmin", "Assert", "Assert not",
"AssertB", "AssertB not", "Reverse", "Once", "Cond", "Cref",
"Brazero", "Braminzero", "Bra"
"Brazero", "Braminzero", "Branumber", "Bra"
};
#endif
@ -111,9 +114,9 @@ static const short int escapes[] = {
0, 0, 0, 0, 0, 0, 0, 0, /* H - O */
0, 0, 0, -ESC_S, 0, 0, 0, -ESC_W, /* P - W */
0, 0, -ESC_Z, '[', '\\', ']', '^', '_', /* X - _ */
'`', 7, -ESC_b, 0, -ESC_d, 27, '\f', 0, /* ` - g */
0, 0, 0, 0, 0, 0, '\n', 0, /* h - o */
0, 0, '\r', -ESC_s, '\t', 0, 0, -ESC_w, /* p - w */
'`', 7, -ESC_b, 0, -ESC_d, ESC_E, ESC_F, 0, /* ` - g */
0, 0, 0, 0, 0, 0, ESC_N, 0, /* h - o */
0, 0, ESC_R, -ESC_s, ESC_T, 0, 0, -ESC_w, /* p - w */
0, 0, -ESC_z /* x - z */
};
@ -208,12 +211,12 @@ byte-mode, and more complicated ones for UTF-8 characters. */
if (md->utf8 && (c & 0xc0) == 0xc0) \
{ \
int a = utf8_table4[c & 0x3f]; /* Number of additional bytes */ \
int s = 6 - a; /* Amount to shift next byte */ \
c &= utf8_table3[a]; /* Low order bits from first byte */ \
int s = 6*a; \
c = (c & utf8_table3[a]) << s; \
while (a-- > 0) \
{ \
s -= 6; \
c |= (*eptr++ & 0x3f) << s; \
s += 6; \
} \
}
@ -226,12 +229,12 @@ byte-mode, and more complicated ones for UTF-8 characters. */
{ \
int i; \
int a = utf8_table4[c & 0x3f]; /* Number of additional bytes */ \
int s = 6 - a; /* Amount to shift next byte */ \
c &= utf8_table3[a]; /* Low order bits from first byte */ \
int s = 6*a; \
c = (c & utf8_table3[a]) << s; \
for (i = 1; i <= a; i++) \
{ \
s -= 6; \
c |= (eptr[i] & 0x3f) << s; \
s += 6; \
} \
len += a; \
}
@ -306,13 +309,13 @@ ord2utf8(int cvalue, uschar *buffer)
register int i, j;
for (i = 0; i < sizeof(utf8_table1)/sizeof(int); i++)
if (cvalue <= utf8_table1[i]) break;
*buffer++ = utf8_table2[i] | (cvalue & utf8_table3[i]);
cvalue >>= 6 - i;
for (j = 0; j < i; j++)
{
*buffer++ = 0x80 | (cvalue & 0x3f);
cvalue >>= 6;
}
buffer += i;
for (j = i; j > 0; j--)
{
*buffer-- = 0x80 | (cvalue & 0x3f);
cvalue >>= 6;
}
*buffer = utf8_table2[i] | cvalue;
return i + 1;
}
#endif
@ -814,10 +817,11 @@ for (;;)
/* Skip over things that don't match chars */
case OP_REVERSE:
case OP_BRANUMBER:
case OP_CREF:
cc++;
/* Fall through */
case OP_CREF:
case OP_OPT:
cc++;
/* Fall through */
@ -871,7 +875,7 @@ for (;;)
/* Check a class for variable quantification */
case OP_CLASS:
cc += (*cc == OP_REF)? 2 : 33;
cc += 33;
switch (*cc)
{
@ -978,7 +982,7 @@ return -1;
Arguments:
options the option bits
brackets points to number of brackets used
brackets points to number of extracting brackets used
code points to the pointer to the current code point
ptrptr points to the current pattern pointer
errorptr points to pointer to error message
@ -1029,7 +1033,7 @@ for (;; ptr++)
int class_charcount;
int class_lastchar;
int newoptions;
int condref;
int skipbytes;
int subreqchar;
c = *ptr;
@ -1040,7 +1044,7 @@ for (;; ptr++)
{
/* The space before the ; is to avoid a warning on a silly compiler
on the Macintosh. */
while ((c = *(++ptr)) != 0 && c != '\n') ;
while ((c = *(++ptr)) != 0 && c != NEWLINE) ;
continue;
}
}
@ -1578,7 +1582,7 @@ for (;; ptr++)
OP_BRAZERO in front of it, and because the group appears once in the
data, whereas in other cases it appears the minimum number of times. For
this reason, it is simplest to treat this case separately, as otherwise
the code gets far too mess. There are several special subcases when the
the code gets far too messy. There are several special subcases when the
minimum is zero. */
if (repeat_min == 0)
@ -1729,7 +1733,7 @@ for (;; ptr++)
case '(':
newoptions = options;
condref = -1;
skipbytes = 0;
if (*(++ptr) == '?')
{
@ -1752,7 +1756,7 @@ for (;; ptr++)
bravalue = OP_COND; /* Conditional group */
if ((cd->ctypes[*(++ptr)] & ctype_digit) != 0)
{
condref = *ptr - '0';
int condref = *ptr - '0';
while (*(++ptr) != ')') condref = condref*10 + *ptr - '0';
if (condref == 0)
{
@ -1760,6 +1764,10 @@ for (;; ptr++)
goto FAILED;
}
ptr++;
code[3] = OP_CREF;
code[4] = condref >> 8;
code[5] = condref & 255;
skipbytes = 3;
}
else ptr--;
break;
@ -1862,16 +1870,21 @@ for (;; ptr++)
}
}
/* Else we have a referencing group; adjust the opcode. */
/* Else we have a referencing group; adjust the opcode. If the bracket
number is greater than EXTRACT_BASIC_MAX, we set the opcode one higher, and
arrange for the true number to follow later, in an OP_BRANUMBER item. */
else
{
if (++(*brackets) > EXTRACT_MAX)
if (++(*brackets) > EXTRACT_BASIC_MAX)
{
*errorptr = ERR13;
goto FAILED;
bravalue = OP_BRA + EXTRACT_BASIC_MAX + 1;
code[3] = OP_BRANUMBER;
code[4] = *brackets >> 8;
code[5] = *brackets & 255;
skipbytes = 3;
}
bravalue = OP_BRA + *brackets;
else bravalue = OP_BRA + *brackets;
}
/* Process nested bracketed re. Assertions may not be repeated, but other
@ -1887,13 +1900,13 @@ for (;; ptr++)
options | PCRE_INGROUP, /* Set for all nested groups */
((options & PCRE_IMS) != (newoptions & PCRE_IMS))?
newoptions & PCRE_IMS : -1, /* Pass ims options if changed */
brackets, /* Bracket level */
brackets, /* Extracting bracket count */
&tempcode, /* Where to put code (updated) */
&ptr, /* Input pointer (updated) */
errorptr, /* Where to put an error message */
(bravalue == OP_ASSERTBACK ||
bravalue == OP_ASSERTBACK_NOT), /* TRUE if back assert */
condref, /* Condition reference number */
skipbytes, /* Skip over OP_COND/OP_BRANUMBER */
&subreqchar, /* For possible last char */
&subcountlits, /* For literal count */
cd)) /* Tables block */
@ -1907,7 +1920,7 @@ for (;; ptr++)
/* If this is a conditional bracket, check that there are no more than
two branches in the group. */
if (bravalue == OP_COND)
else if (bravalue == OP_COND)
{
uschar *tc = code;
condcount = 0;
@ -1974,9 +1987,11 @@ for (;; ptr++)
{
if (-c >= ESC_REF)
{
int number = -c - ESC_REF;
previous = code;
*code++ = OP_REF;
*code++ = -c - ESC_REF;
*code++ = number >> 8;
*code++ = number & 255;
}
else
{
@ -2011,7 +2026,7 @@ for (;; ptr++)
{
/* The space before the ; is to avoid a warning on a silly compiler
on the Macintosh. */
while ((c = *(++ptr)) != 0 && c != '\n') ;
while ((c = *(++ptr)) != 0 && c != NEWLINE) ;
if (c == 0) break;
continue;
}
@ -2100,7 +2115,7 @@ Argument:
ptrptr -> the address of the current pattern pointer
errorptr -> pointer to error message
lookbehind TRUE if this is a lookbehind assertion
condref >= 0 for OPT_CREF setting at start of conditional group
skipbytes skip this many bytes at start (for OP_COND, OP_BRANUMBER)
reqchar -> place to put the last required character, or a negative number
countlits -> place to put the shortest literal count of any branch
cd points to the data block with tables pointers
@ -2110,7 +2125,7 @@ Returns: TRUE on success
static BOOL
compile_regex(int options, int optchanged, int *brackets, uschar **codeptr,
const uschar **ptrptr, const char **errorptr, BOOL lookbehind, int condref,
const uschar **ptrptr, const char **errorptr, BOOL lookbehind, int skipbytes,
int *reqchar, int *countlits, compile_data *cd)
{
const uschar *ptr = *ptrptr;
@ -2123,16 +2138,7 @@ int branchreqchar, branchcountlits;
*reqchar = -1;
*countlits = INT_MAX;
code += 3;
/* At the start of a reference-based conditional group, insert the reference
number as an OP_CREF item. */
if (condref >= 0)
{
*code++ = OP_CREF;
*code++ = condref;
}
code += 3 + skipbytes;
/* Loop for each alternative branch */
@ -2284,7 +2290,8 @@ for (;;)
break;
case OP_CREF:
code += 2;
case OP_BRANUMBER:
code += 3;
break;
case OP_WORD_BOUNDARY:
@ -2547,6 +2554,7 @@ while ((c = *(++ptr)) != 0)
{
int min, max;
int class_charcount;
int bracket_length;
if ((options & PCRE_EXTENDED) != 0)
{
@ -2555,7 +2563,7 @@ while ((c = *(++ptr)) != 0)
{
/* The space before the ; is to avoid a warning on a silly compiler
on the Macintosh. */
while ((c = *(++ptr)) != 0 && c != '\n') ;
while ((c = *(++ptr)) != 0 && c != NEWLINE) ;
continue;
}
}
@ -2581,7 +2589,7 @@ while ((c = *(++ptr)) != 0)
}
length++;
/* A back reference needs an additional char, plus either one or 5
/* A back reference needs an additional 2 bytes, plus either one or 5
bytes for a repeat. We also need to keep the value of the highest
back reference. */
@ -2589,7 +2597,7 @@ while ((c = *(++ptr)) != 0)
{
int refnum = -c - ESC_REF;
if (refnum > top_backref) top_backref = refnum;
length++; /* For single back reference */
length += 2; /* For single back reference */
if (ptr[1] == '{' && is_counted_repeat(ptr+2, &compile_block))
{
ptr = read_repeat_counts(ptr+2, &min, &max, errorptr, &compile_block);
@ -2687,6 +2695,7 @@ while ((c = *(++ptr)) != 0)
case '(':
branch_newextra = 0;
bracket_length = 3;
/* Handle special forms of bracket, which all start (? */
@ -2754,7 +2763,7 @@ while ((c = *(++ptr)) != 0)
if ((compile_block.ctypes[ptr[3]] & ctype_digit) != 0)
{
ptr += 4;
length += 2;
length += 3;
while ((compile_block.ctypes[*ptr] & ctype_digit) != 0) ptr++;
if (*ptr != ')')
{
@ -2881,15 +2890,19 @@ while ((c = *(++ptr)) != 0)
}
/* Extracting brackets must be counted so we can process escapes in a
Perlish way. */
Perlish way. If the number exceeds EXTRACT_BASIC_MAX we are going to
need an additional 3 bytes of store per extracting bracket. */
else bracount++;
else
{
bracount++;
if (bracount > EXTRACT_BASIC_MAX) bracket_length += 3;
}
/* Non-special forms of bracket. Save length for computing whole length
at end if there's a repeat that requires duplication of the group. Also
save the current value of branch_extra, and start the new group with
the new value. If non-zero, this will either be 2 for a (?imsx: group, or 3
for a lookbehind assertion. */
/* Save length for computing whole length at end if there's a repeat that
requires duplication of the group. Also save the current value of
branch_extra, and start the new group with the new value. If non-zero, this
will either be 2 for a (?imsx: group, or 3 for a lookbehind assertion. */
if (brastackptr >= sizeof(brastack)/sizeof(int))
{
@ -2901,7 +2914,7 @@ while ((c = *(++ptr)) != 0)
branch_extra = branch_newextra;
brastack[brastackptr++] = length;
length += 3;
length += bracket_length;
continue;
/* Handle ket. Look for subsequent max/min; for certain sets of values we
@ -2981,7 +2994,7 @@ while ((c = *(++ptr)) != 0)
{
/* The space before the ; is to avoid a warning on a silly compiler
on the Macintosh. */
while ((c = *(++ptr)) != 0 && c != '\n') ;
while ((c = *(++ptr)) != 0 && c != NEWLINE) ;
continue;
}
}
@ -3062,7 +3075,7 @@ ptr = (const uschar *)pattern;
code = re->code;
*code = OP_BRA;
bracount = 0;
(void)compile_regex(options, -1, &bracount, &code, &ptr, errorptr, FALSE, -1,
(void)compile_regex(options, -1, &bracount, &code, &ptr, errorptr, FALSE, 0,
&reqchar, &countlits, &compile_block);
re->top_bracket = bracount;
re->top_backref = top_backref;
@ -3176,7 +3189,10 @@ while (code < code_end)
if (*code >= OP_BRA)
{
printf("%3d Bra %d", (code[1] << 8) + code[2], *code - OP_BRA);
if (*code - OP_BRA > EXTRACT_BASIC_MAX)
printf("%3d Bra extra", (code[1] << 8) + code[2]);
else
printf("%3d Bra %d", (code[1] << 8) + code[2], *code - OP_BRA);
code += 2;
}
@ -3187,16 +3203,6 @@ while (code < code_end)
code++;
break;
case OP_COND:
printf("%3d Cond", (code[1] << 8) + code[2]);
code += 2;
break;
case OP_CREF:
printf(" %.2d %s", code[1], OP_names[*code]);
code++;
break;
case OP_CHARS:
charlength = *(++code);
printf("%3d ", charlength);
@ -3213,11 +3219,10 @@ while (code < code_end)
case OP_ASSERTBACK:
case OP_ASSERTBACK_NOT:
case OP_ONCE:
printf("%3d %s", (code[1] << 8) + code[2], OP_names[*code]);
code += 2;
break;
case OP_REVERSE:
case OP_BRANUMBER:
case OP_COND:
case OP_CREF:
printf("%3d %s", (code[1] << 8) + code[2], OP_names[*code]);
code += 2;
break;
@ -3290,8 +3295,8 @@ while (code < code_end)
break;
case OP_REF:
printf(" \\%d", *(++code));
code ++;
printf(" \\%d", (code[1] << 8) | code[2]);
code += 3;
goto CLASS_REF_REPEAT;
case OP_CLASS:
@ -3504,8 +3509,14 @@ for (;;)
if (op > OP_BRA)
{
int offset;
int number = op - OP_BRA;
int offset = number << 1;
/* For extended extraction brackets (large number), we have to fish out the
number from a dummy opcode at the start. */
if (number > EXTRACT_BASIC_MAX) number = (ecode[4] << 8) | ecode[5];
offset = number << 1;
#ifdef DEBUG
printf("start bracket %d subject=", number);
@ -3535,6 +3546,7 @@ for (;;)
md->offset_vector[offset] = save_offset1;
md->offset_vector[offset+1] = save_offset2;
md->offset_vector[md->offset_end - number] = save_offset3;
return FALSE;
}
@ -3567,10 +3579,10 @@ for (;;)
case OP_COND:
if (ecode[3] == OP_CREF) /* Condition is extraction test */
{
int offset = ecode[4] << 1; /* Doubled reference number */
int offset = (ecode[4] << 9) | (ecode[5] << 1); /* Doubled ref number */
return match(eptr,
ecode + ((offset < offset_top && md->offset_vector[offset] >= 0)?
5 : 3 + (ecode[1] << 8) + ecode[2]),
6 : 3 + (ecode[1] << 8) + ecode[2]),
offset_top, md, ims, eptrb, match_isgroup);
}
@ -3590,10 +3602,12 @@ for (;;)
}
/* Control never reaches here */
/* Skip over conditional reference data if encountered (should not be) */
/* Skip over conditional reference or large extraction number data if
encountered. */
case OP_CREF:
ecode += 2;
case OP_BRANUMBER:
ecode += 3;
break;
/* End of the pattern. If PCRE_NOTEMPTY is set, fail if we have matched
@ -3859,8 +3873,14 @@ for (;;)
if (*prev != OP_COND)
{
int offset;
int number = *prev - OP_BRA;
int offset = number << 1;
/* For extended extraction brackets (large number), we have to fish out
the number from a dummy opcode at the start. */
if (number > EXTRACT_BASIC_MAX) number = (prev[4] << 8) | prev[5];
offset = number << 1;
#ifdef DEBUG
printf("end bracket %d", number);
@ -3920,7 +3940,7 @@ for (;;)
if (md->notbol && eptr == md->start_subject) return FALSE;
if ((ims & PCRE_MULTILINE) != 0)
{
if (eptr != md->start_subject && eptr[-1] != '\n') return FALSE;
if (eptr != md->start_subject && eptr[-1] != NEWLINE) return FALSE;
ecode++;
break;
}
@ -3939,7 +3959,7 @@ for (;;)
case OP_DOLL:
if ((ims & PCRE_MULTILINE) != 0)
{
if (eptr < md->end_subject) { if (*eptr != '\n') return FALSE; }
if (eptr < md->end_subject) { if (*eptr != NEWLINE) return FALSE; }
else { if (md->noteol) return FALSE; }
ecode++;
break;
@ -3950,7 +3970,7 @@ for (;;)
if (!md->endonly)
{
if (eptr < md->end_subject - 1 ||
(eptr == md->end_subject - 1 && *eptr != '\n')) return FALSE;
(eptr == md->end_subject - 1 && *eptr != NEWLINE)) return FALSE;
ecode++;
break;
@ -3969,7 +3989,7 @@ for (;;)
case OP_EODN:
if (eptr < md->end_subject - 1 ||
(eptr == md->end_subject - 1 && *eptr != '\n')) return FALSE;
(eptr == md->end_subject - 1 && *eptr != NEWLINE)) return FALSE;
ecode++;
break;
@ -3991,7 +4011,7 @@ for (;;)
/* Match a single character type; inline for speed */
case OP_ANY:
if ((ims & PCRE_DOTALL) == 0 && eptr < md->end_subject && *eptr == '\n')
if ((ims & PCRE_DOTALL) == 0 && eptr < md->end_subject && *eptr == NEWLINE)
return FALSE;
if (eptr++ >= md->end_subject) return FALSE;
#ifdef SUPPORT_UTF8
@ -4054,8 +4074,8 @@ for (;;)
case OP_REF:
{
int length;
int offset = ecode[1] << 1; /* Doubled reference number */
ecode += 2; /* Advance past the item */
int offset = (ecode[1] << 9) | (ecode[2] << 1); /* Doubled ref number */
ecode += 3; /* Advance past item */
/* If the reference is unset, set the length to be longer than the amount
of subject left; this ensures that every attempt at a match fails. We
@ -4599,7 +4619,7 @@ for (;;)
for (i = 1; i <= min; i++)
{
if (eptr >= md->end_subject ||
(*eptr++ == '\n' && (ims & PCRE_DOTALL) == 0))
(*eptr++ == NEWLINE && (ims & PCRE_DOTALL) == 0))
return FALSE;
while (eptr < md->end_subject && (*eptr & 0xc0) == 0x80) eptr++;
}
@ -4608,7 +4628,7 @@ for (;;)
#endif
/* Non-UTF8 can be faster */
if ((ims & PCRE_DOTALL) == 0)
{ for (i = 1; i <= min; i++) if (*eptr++ == '\n') return FALSE; }
{ for (i = 1; i <= min; i++) if (*eptr++ == NEWLINE) return FALSE; }
else eptr += min;
break;
@ -4663,7 +4683,7 @@ for (;;)
switch(ctype)
{
case OP_ANY:
if ((ims & PCRE_DOTALL) == 0 && c == '\n') return FALSE;
if ((ims & PCRE_DOTALL) == 0 && c == NEWLINE) return FALSE;
#ifdef SUPPORT_UTF8
if (md->utf8)
while (eptr < md->end_subject && (*eptr & 0xc0) == 0x80) eptr++;
@ -4718,7 +4738,7 @@ for (;;)
{
for (i = min; i < max; i++)
{
if (eptr >= md->end_subject || *eptr++ == '\n') break;
if (eptr >= md->end_subject || *eptr++ == NEWLINE) break;
while (eptr < md->end_subject && (*eptr & 0xc0) == 0x80) eptr++;
}
}
@ -4738,7 +4758,7 @@ for (;;)
{
for (i = min; i < max; i++)
{
if (eptr >= md->end_subject || *eptr == '\n') break;
if (eptr >= md->end_subject || *eptr == NEWLINE) break;
eptr++;
}
}
@ -4879,8 +4899,8 @@ const uschar *req_char_ptr = start_match - 1;
const real_pcre *re = (const real_pcre *)external_re;
const real_pcre_extra *extra = (const real_pcre_extra *)external_extra;
BOOL using_temporary_offsets = FALSE;
BOOL anchored = ((re->options | options) & PCRE_ANCHORED) != 0;
BOOL startline = (re->options & PCRE_STARTLINE) != 0;
BOOL anchored;
BOOL startline;
if ((options & ~PUBLIC_EXEC_OPTIONS) != 0) return PCRE_ERROR_BADOPTION;
@ -4888,6 +4908,9 @@ if (re == NULL || subject == NULL ||
(offsets == NULL && offsetcount > 0)) return PCRE_ERROR_NULL;
if (re->magic_number != MAGIC_NUMBER) return PCRE_ERROR_BADMAGIC;
anchored = ((re->options | options) & PCRE_ANCHORED) != 0;
startline = (re->options & PCRE_STARTLINE) != 0;
match_block.start_pattern = re->code;
match_block.start_subject = (const uschar *)subject;
match_block.end_subject = match_block.start_subject + length;
@ -5016,7 +5039,7 @@ do
{
if (start_match > match_block.start_subject + start_offset)
{
while (start_match < end_subject && start_match[-1] != '\n')
while (start_match < end_subject && start_match[-1] != NEWLINE)
start_match++;
}
}
@ -5121,7 +5144,7 @@ do
rc = match_block.offset_overflow? 0 : match_block.end_offset_top/2;
if (match_block.offset_end < 2) rc = 0; else
if (offsetcount < 2) rc = 0; else
{
offsets[0] = start_match - match_block.start_subject;
offsets[1] = match_block.end_match_ptr - match_block.start_subject;

View File

@ -2,7 +2,7 @@
* Perl-Compatible Regular Expressions *
*************************************************/
/* Copyright (c) 1997-2000 University of Cambridge */
/* Copyright (c) 1997-2001 University of Cambridge */
#ifndef _PCRE_H
#define _PCRE_H
@ -11,9 +11,9 @@
make changes to pcre.in. */
#include "php_compat.h"
#define PCRE_MAJOR 3
#define PCRE_MINOR 4
#define PCRE_DATE 22-Aug-2000
#define PCRE_MAJOR 3
#define PCRE_MINOR 9
#define PCRE_DATE 02-Jan-2002
/* Win32 uses DLL by default */
@ -75,8 +75,11 @@ extern "C" {
/* Types */
typedef void pcre;
typedef void pcre_extra;
struct real_pcre; /* declaration; the definition is private */
struct real_pcre_extra; /* declaration; the definition is private */
typedef struct real_pcre pcre;
typedef struct real_pcre_extra pcre_extra;
/* Store get and free functions. These can be set to alternative malloc/free
functions if required. Some magic is required for Win32 DLL; it is null on
@ -100,7 +103,7 @@ extern int pcre_get_substring(const char *, int *, int, int, const char **);
extern int pcre_get_substring_list(const char *, int *, int, const char ***);
extern int pcre_info(const pcre *, int *, int *);
extern int pcre_fullinfo(const pcre *, const pcre_extra *, int, void *);
extern unsigned const char *pcre_maketables(void);
extern const unsigned char *pcre_maketables(void);
extern pcre_extra *pcre_study(const pcre *, int, const char **);
extern const char *pcre_version(void);

View File

@ -3,8 +3,9 @@
*************************************************/
/* This is a grep program that uses the PCRE regular expression library to do
its pattern matching. */
its pattern matching. On a Unix system it can recurse into directories. */
#include <ctype.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
@ -17,22 +18,122 @@ its pattern matching. */
typedef int BOOL;
#define VERSION "2.0 01-Aug-2001"
#define MAX_PATTERN_COUNT 100
/*************************************************
* Global variables *
*************************************************/
static pcre *pattern;
static pcre_extra *hints;
static char *pattern_filename = NULL;
static int pattern_count = 0;
static pcre **pattern_list;
static pcre_extra **hints_list;
static BOOL count_only = FALSE;
static BOOL filenames = TRUE;
static BOOL filenames_only = FALSE;
static BOOL invert = FALSE;
static BOOL number = FALSE;
static BOOL recurse = FALSE;
static BOOL silent = FALSE;
static BOOL whole_lines = FALSE;
/* Structure for options and list of them */
typedef struct option_item {
int one_char;
char *long_name;
char *help_text;
} option_item;
static option_item optionlist[] = {
{ -1, "help", "display this help and exit" },
{ 'c', "count", "print only a count of matching lines per FILE" },
{ 'h', "no-filename", "suppress the prefixing filename on output" },
{ 'i', "ignore-case", "ignore case distinctions" },
{ 'l', "files-with-matches", "print only FILE names containing matches" },
{ 'n', "line-number", "print line number with output lines" },
{ 'r', "recursive", "recursively scan sub-directories" },
{ 's', "no-messages", "suppress error messages" },
{ 'V', "version", "print version information and exit" },
{ 'v', "invert-match", "select non-matching lines" },
{ 'x', "line-regex", "force PATTERN to match only whole lines" },
{ 'x', "line-regexp", "force PATTERN to match only whole lines" },
{ 0, NULL, NULL }
};
/*************************************************
* Functions for directory scanning *
*************************************************/
/* These functions are defined so that they can be made system specific,
although at present the only ones are for Unix, and for "no directory recursion
support". */
/************* Directory scanning in Unix ***********/
#if IS_UNIX
#include <sys/types.h>
#include <sys/stat.h>
#include <dirent.h>
typedef DIR directory_type;
int
isdirectory(char *filename)
{
struct stat statbuf;
if (stat(filename, &statbuf) < 0)
return 0; /* In the expectation that opening as a file will fail */
return ((statbuf.st_mode & S_IFMT) == S_IFDIR)? '/' : 0;
}
directory_type *
opendirectory(char *filename)
{
return opendir(filename);
}
char *
readdirectory(directory_type *dir)
{
for (;;)
{
struct dirent *dent = readdir(dir);
if (dent == NULL) return NULL;
if (strcmp(dent->d_name, ".") != 0 && strcmp(dent->d_name, "..") != 0)
return dent->d_name;
}
return NULL; /* Keep compiler happy; never executed */
}
void
closedirectory(directory_type *dir)
{
closedir(dir);
}
#else
/************* Directory scanning when we can't do it ***********/
/* The type is void, and apart from isdirectory(), the functions do nothing. */
typedef void directory_type;
int isdirectory(char *filename) { return FALSE; }
directory_type * opendirectory(char *filename) {}
char *readdirectory(directory_type *dir) {}
void closedirectory(directory_type *dir) {}
#endif
#if ! HAVE_STRERROR
@ -72,13 +173,18 @@ char buffer[BUFSIZ];
while (fgets(buffer, sizeof(buffer), in) != NULL)
{
BOOL match;
BOOL match = FALSE;
int i;
int length = (int)strlen(buffer);
if (length > 0 && buffer[length-1] == '\n') buffer[--length] = 0;
linenumber++;
match = pcre_exec(pattern, hints, buffer, length, 0, 0, offsets, 99) >= 0;
if (match && whole_lines && offsets[1] != length) match = FALSE;
for (i = 0; !match && i < pattern_count; i++)
{
match = pcre_exec(pattern_list[i], hints_list[i], buffer, length, 0, 0,
offsets, 99) >= 0;
if (match && whole_lines && offsets[1] != length) match = FALSE;
}
if (match != invert)
{
@ -115,6 +221,65 @@ return rc;
/*************************************************
* Grep a file or recurse into a directory *
*************************************************/
static int
grep_or_recurse(char *filename, BOOL recurse, BOOL show_filenames,
BOOL only_one_at_top)
{
int rc = 1;
int sep;
FILE *in;
/* If the file is a directory and we are recursing, scan each file within it.
The scanning code is localized so it can be made system-specific. */
if ((sep = isdirectory(filename)) != 0 && recurse)
{
char buffer[1024];
char *nextfile;
directory_type *dir = opendirectory(filename);
if (dir == NULL)
{
fprintf(stderr, "pcregrep: Failed to open directory %s: %s\n", filename,
strerror(errno));
return 2;
}
while ((nextfile = readdirectory(dir)) != NULL)
{
int frc;
sprintf(buffer, "%.512s%c%.128s", filename, sep, nextfile);
frc = grep_or_recurse(buffer, recurse, TRUE, FALSE);
if (frc == 0 && rc == 1) rc = 0;
}
closedirectory(dir);
return rc;
}
/* If the file is not a directory, or we are not recursing, scan it. If this is
the first and only argument at top level, we don't show the file name.
Otherwise, control is via the show_filenames variable. */
in = fopen(filename, "r");
if (in == NULL)
{
fprintf(stderr, "pcregrep: Failed to open %s: %s\n", filename, strerror(errno));
return 2;
}
rc = pcregrep(in, (show_filenames && !only_one_at_top)? filename : NULL);
fclose(in);
return rc;
}
/*************************************************
* Usage function *
*************************************************/
@ -122,13 +287,89 @@ return rc;
static int
usage(int rc)
{
fprintf(stderr, "Usage: pcregrep [-Vchilnsvx] pattern [file] ...\n");
fprintf(stderr, "Usage: pcregrep [-Vcfhilnrsvx] [long-options] pattern [file] ...\n");
fprintf(stderr, "Type `pcregrep --help' for more information.\n");
return rc;
}
/*************************************************
* Help function *
*************************************************/
static void
help(void)
{
option_item *op;
printf("Usage: pcregrep [OPTION]... PATTERN [FILE] ...\n");
printf("Search for PATTERN in each FILE or standard input.\n");
printf("Example: pcregrep -i 'hello.*world' menu.h main.c\n\n");
printf("Options:\n");
for (op = optionlist; op->one_char != 0; op++)
{
int n;
char s[4];
if (op->one_char > 0) sprintf(s, "-%c,", op->one_char); else strcpy(s, " ");
printf(" %s --%s%n", s, op->long_name, &n);
n = 30 - n;
if (n < 1) n = 1;
printf("%.*s%s\n", n, " ", op->help_text);
}
printf("\n -f<filename> or --file=<filename>\n");
printf(" Read patterns from <filename> instead of using a command line option.\n");
printf(" Trailing white space is removed; blanks lines are ignored.\n");
printf(" There is a maximum of %d patterns.\n", MAX_PATTERN_COUNT);
printf("\nWith no FILE, read standard input. If fewer than two FILEs given, assume -h.\n");
printf("Exit status is 0 if any matches, 1 if no matches, and 2 if trouble.\n");
}
/*************************************************
* Handle an option *
*************************************************/
static int
handle_option(int letter, int options)
{
switch(letter)
{
case -1: help(); exit(0);
case 'c': count_only = TRUE; break;
case 'h': filenames = FALSE; break;
case 'i': options |= PCRE_CASELESS; break;
case 'l': filenames_only = TRUE;
case 'n': number = TRUE; break;
case 'r': recurse = TRUE; break;
case 's': silent = TRUE; break;
case 'v': invert = TRUE; break;
case 'x': whole_lines = TRUE; options |= PCRE_ANCHORED; break;
case 'V':
fprintf(stderr, "pcregrep version %s using ", VERSION);
fprintf(stderr, "PCRE version %s\n", pcre_version());
exit(0);
break;
default:
fprintf(stderr, "pcregrep: Unknown option -%c\n", letter);
exit(usage(2));
}
return options;
}
/*************************************************
* Main program *
*************************************************/
@ -136,90 +377,161 @@ return rc;
int
main(int argc, char **argv)
{
int i;
int i, j;
int rc = 1;
int options = 0;
int errptr;
const char *error;
BOOL filenames = TRUE;
BOOL only_one_at_top;
/* Process the options */
for (i = 1; i < argc; i++)
{
char *s;
if (argv[i][0] != '-') break;
s = argv[i] + 1;
while (*s != 0)
/* Long name options */
if (argv[i][1] == '-')
{
switch (*s++)
option_item *op;
if (strncmp(argv[i]+2, "file=", 5) == 0)
{
case 'c': count_only = TRUE; break;
case 'h': filenames = FALSE; break;
case 'i': options |= PCRE_CASELESS; break;
case 'l': filenames_only = TRUE;
case 'n': number = TRUE; break;
case 's': silent = TRUE; break;
case 'v': invert = TRUE; break;
case 'x': whole_lines = TRUE; options |= PCRE_ANCHORED; break;
pattern_filename = argv[i] + 7;
continue;
}
case 'V':
fprintf(stderr, "PCRE version %s\n", pcre_version());
break;
for (op = optionlist; op->one_char != 0; op++)
{
if (strcmp(argv[i]+2, op->long_name) == 0)
{
options = handle_option(op->one_char, options);
break;
}
}
if (op->one_char == 0)
{
fprintf(stderr, "pcregrep: Unknown option %s\n", argv[i]);
exit(usage(2));
}
}
default:
fprintf(stderr, "pcregrep: unknown option %c\n", s[-1]);
return usage(2);
/* One-char options */
else
{
char *s = argv[i] + 1;
while (*s != 0)
{
if (*s == 'f')
{
pattern_filename = s + 1;
if (pattern_filename[0] == 0)
{
if (i >= argc - 1)
{
fprintf(stderr, "pcregrep: File name missing after -f\n");
exit(usage(2));
}
pattern_filename = argv[++i];
}
break;
}
else options = handle_option(*s++, options);
}
}
}
/* There must be at least a regexp argument */
pattern_list = malloc(MAX_PATTERN_COUNT * sizeof(pcre *));
hints_list = malloc(MAX_PATTERN_COUNT * sizeof(pcre_extra *));
if (i >= argc) return usage(0);
/* Compile the regular expression. */
pattern = pcre_compile(argv[i++], options, &error, &errptr, NULL);
if (pattern == NULL)
if (pattern_list == NULL || hints_list == NULL)
{
fprintf(stderr, "pcregrep: error in regex at offset %d: %s\n", errptr, error);
fprintf(stderr, "pcregrep: malloc failed\n");
return 2;
}
/* Study the regular expression, as we will be running it may times */
/* Compile the regular expression(s). */
hints = pcre_study(pattern, 0, &error);
if (error != NULL)
if (pattern_filename != NULL)
{
fprintf(stderr, "pcregrep: error while studing regex: %s\n", error);
return 2;
FILE *f = fopen(pattern_filename, "r");
char buffer[BUFSIZ];
if (f == NULL)
{
fprintf(stderr, "pcregrep: Failed to open %s: %s\n", pattern_filename,
strerror(errno));
return 2;
}
while (fgets(buffer, sizeof(buffer), f) != NULL)
{
char *s = buffer + (int)strlen(buffer);
if (pattern_count >= MAX_PATTERN_COUNT)
{
fprintf(stderr, "pcregrep: Too many patterns in file (max %d)\n",
MAX_PATTERN_COUNT);
return 2;
}
while (s > buffer && isspace((unsigned char)(s[-1]))) s--;
if (s == buffer) continue;
*s = 0;
pattern_list[pattern_count] = pcre_compile(buffer, options, &error,
&errptr, NULL);
if (pattern_list[pattern_count++] == NULL)
{
fprintf(stderr, "pcregrep: Error in regex number %d at offset %d: %s\n",
pattern_count, errptr, error);
return 2;
}
}
fclose(f);
}
/* If no file name, a single regex must be given inline */
else
{
if (i >= argc) return usage(0);
pattern_list[0] = pcre_compile(argv[i++], options, &error, &errptr, NULL);
if (pattern_list[0] == NULL)
{
fprintf(stderr, "pcregrep: Error in regex at offset %d: %s\n", errptr,
error);
return 2;
}
pattern_count++;
}
/* Study the regular expressions, as we will be running them may times */
for (j = 0; j < pattern_count; j++)
{
hints_list[j] = pcre_study(pattern_list[j], 0, &error);
if (error != NULL)
{
char s[16];
if (pattern_count == 1) s[0] = 0; else sprintf(s, " number %d", j);
fprintf(stderr, "pcregrep: Error while studying regex%s: %s\n", s, error);
return 2;
}
}
/* If there are no further arguments, do the business on stdin and exit */
if (i >= argc) return pcregrep(stdin, NULL);
/* Otherwise, work through the remaining arguments as files. If there is only
one, don't give its name on the output. */
/* Otherwise, work through the remaining arguments as files or directories.
Pass in the fact that there is only one argument at top level - this suppresses
the file name if the argument is not a directory. */
if (i == argc - 1) filenames = FALSE;
only_one_at_top = (i == argc - 1);
if (filenames_only) filenames = TRUE;
for (; i < argc; i++)
{
FILE *in = fopen(argv[i], "r");
if (in == NULL)
{
fprintf(stderr, "%s: failed to open: %s\n", argv[i], strerror(errno));
rc = 2;
}
else
{
int frc = pcregrep(in, filenames? argv[i] : NULL);
if (frc == 0 && rc == 1) rc = 0;
fclose(in);
}
int frc = grep_or_recurse(argv[i], recurse, filenames, only_one_at_top);
if (frc == 0 && rc == 1) rc = 0;
}
return rc;

View File

@ -12,7 +12,7 @@ functions.
Written by: Philip Hazel <ph10@cam.ac.uk>
Copyright (c) 1997-2000 University of Cambridge
Copyright (c) 1997-2001 University of Cambridge
-----------------------------------------------------------------------------
Permission is granted to anyone to use this software for any purpose on any
@ -62,13 +62,13 @@ static int eint[] = {
REG_BADRPT, /* "operand of unlimited repeat could match the empty string" */
REG_ASSERT, /* "internal error: unexpected repeat" */
REG_BADPAT, /* "unrecognized character after (?" */
REG_ESIZE, /* "too many capturing parenthesized sub-patterns" */
REG_ASSERT, /* "unused error" */
REG_EPAREN, /* "missing )" */
REG_ESUBREG, /* "back reference to non-existent subpattern" */
REG_INVARG, /* "erroffset passed as NULL" */
REG_INVARG, /* "unknown option bit(s) set" */
REG_EPAREN, /* "missing ) after comment" */
REG_ESIZE, /* "too many sets of parentheses" */
REG_ESIZE, /* "parentheses nested too deeply" */
REG_ESIZE, /* "regular expression too large" */
REG_ESPACE, /* "failed to get memory" */
REG_EPAREN, /* "unmatched brackets" */

View File

@ -2,7 +2,7 @@
* Perl-Compatible Regular Expressions *
*************************************************/
/* Copyright (c) 1997-2000 University of Cambridge */
/* Copyright (c) 1997-2001 University of Cambridge */
#ifndef _PCREPOSIX_H
#define _PCREPOSIX_H

View File

@ -73,13 +73,14 @@ for (i = 0; i < sizeof(utf8_table1)/sizeof(int); i++)
if (cvalue <= utf8_table1[i]) break;
if (i >= sizeof(utf8_table1)/sizeof(int)) return 0;
if (cvalue < 0) return -1;
*buffer++ = utf8_table2[i] | (cvalue & utf8_table3[i]);
cvalue >>= 6 - i;
for (j = 0; j < i; j++)
{
*buffer++ = 0x80 | (cvalue & 0x3f);
cvalue >>= 6;
}
buffer += i;
for (j = i; j > 0; j--)
{
*buffer-- = 0x80 | (cvalue & 0x3f);
cvalue >>= 6;
}
*buffer = utf8_table2[i] | cvalue;
return i + 1;
}
@ -117,15 +118,15 @@ if (i == 0 || i == 6) return 0; /* invalid UTF-8 */
/* i now has a value in the range 1-5 */
d = c & utf8_table3[i];
s = 6 - i;
s = 6*i;
d = (c & utf8_table3[i]) << s;
for (j = 0; j < i; j++)
{
c = *buffer++;
if ((c & 0xc0) != 0x80) return -(j+1);
s -= 6;
d |= (c & 0x3f) << s;
s += 6;
}
/* Check that encoding was the correct unique one */
@ -159,7 +160,7 @@ static const char *OP_names[] = {
"class", "Ref", "Recurse",
"Alt", "Ket", "KetRmax", "KetRmin", "Assert", "Assert not",
"AssertB", "AssertB not", "Reverse", "Once", "Cond", "Cref",
"Brazero", "Braminzero", "Bra"
"Brazero", "Braminzero", "Branumber", "Bra"
};
@ -178,7 +179,10 @@ for(;;)
if (*code >= OP_BRA)
{
fprintf(outfile, "%3d Bra %d", (code[1] << 8) + code[2], *code - OP_BRA);
if (*code - OP_BRA > EXTRACT_BASIC_MAX)
fprintf(outfile, "%3d Bra extra", (code[1] << 8) + code[2]);
else
fprintf(outfile, "%3d Bra %d", (code[1] << 8) + code[2], *code - OP_BRA);
code += 2;
}
@ -194,16 +198,6 @@ for(;;)
code++;
break;
case OP_COND:
fprintf(outfile, "%3d Cond", (code[1] << 8) + code[2]);
code += 2;
break;
case OP_CREF:
fprintf(outfile, " %.2d %s", code[1], OP_names[*code]);
code++;
break;
case OP_CHARS:
charlength = *(++code);
fprintf(outfile, "%3d ", charlength);
@ -221,11 +215,10 @@ for(;;)
case OP_ASSERTBACK:
case OP_ASSERTBACK_NOT:
case OP_ONCE:
fprintf(outfile, "%3d %s", (code[1] << 8) + code[2], OP_names[*code]);
code += 2;
break;
case OP_COND:
case OP_BRANUMBER:
case OP_REVERSE:
case OP_CREF:
fprintf(outfile, "%3d %s", (code[1] << 8) + code[2], OP_names[*code]);
code += 2;
break;
@ -298,8 +291,8 @@ for(;;)
break;
case OP_REF:
fprintf(outfile, " \\%d", *(++code));
code++;
fprintf(outfile, " \\%d", (code[1] << 8) | code[2]);
code += 3;
goto CLASS_REF_REPEAT;
case OP_CLASS:
@ -441,7 +434,12 @@ int op = 1;
int timeit = 0;
int showinfo = 0;
int showstore = 0;
int size_offsets = 45;
int size_offsets_max;
int *offsets;
#if !defined NOPOSIX
int posix = 0;
#endif
int debug = 0;
int done = 0;
unsigned char buffer[30000];
@ -455,27 +453,51 @@ outfile = stdout;
while (argc > 1 && argv[op][0] == '-')
{
char *endptr;
if (strcmp(argv[op], "-s") == 0 || strcmp(argv[op], "-m") == 0)
showstore = 1;
else if (strcmp(argv[op], "-t") == 0) timeit = 1;
else if (strcmp(argv[op], "-i") == 0) showinfo = 1;
else if (strcmp(argv[op], "-d") == 0) showinfo = debug = 1;
else if (strcmp(argv[op], "-o") == 0 && argc > 2 &&
((size_offsets = (int)strtoul(argv[op+1], &endptr, 10)), *endptr == 0))
{
op++;
argc--;
}
#if !defined NOPOSIX
else if (strcmp(argv[op], "-p") == 0) posix = 1;
#endif
else
{
printf("*** Unknown option %s\n", argv[op]);
printf("Usage: pcretest [-d] [-i] [-p] [-s] [-t] [<input> [<output>]]\n");
printf(" -d debug: show compiled code; implies -i\n"
" -i show information about compiled pattern\n"
" -p use POSIX interface\n"
" -s output store information\n"
" -t time compilation and execution\n");
printf("** Unknown or malformed option %s\n", argv[op]);
printf("Usage: pcretest [-d] [-i] [-o <n>] [-p] [-s] [-t] [<input> [<output>]]\n");
printf(" -d debug: show compiled code; implies -i\n"
" -i show information about compiled pattern\n"
" -o <n> set size of offsets vector to <n>\n");
#if !defined NOPOSIX
printf(" -p use POSIX interface\n");
#endif
printf(" -s output store information\n"
" -t time compilation and execution\n");
return 1;
}
op++;
argc--;
}
/* Get the store for the offsets vector, and remember what it was */
size_offsets_max = size_offsets;
offsets = malloc(size_offsets_max * sizeof(int));
if (offsets == NULL)
{
printf("** Failed to get %d bytes of memory for offsets vector\n",
size_offsets_max * sizeof(int));
return 1;
}
/* Sort out the input and output files */
if (argc > 1)
@ -520,7 +542,7 @@ while (!done)
const char *error;
unsigned char *p, *pp, *ppp;
unsigned const char *tables = NULL;
const unsigned char *tables = NULL;
int do_study = 0;
int do_debug = debug;
int do_G = 0;
@ -720,13 +742,14 @@ while (!done)
if (do_showinfo)
{
unsigned long int get_options;
int old_first_char, old_options, old_count;
int count, backrefmax, first_char, need_char;
size_t size;
if (do_debug) print_internals(re);
new_info(re, NULL, PCRE_INFO_OPTIONS, &options);
new_info(re, NULL, PCRE_INFO_OPTIONS, &get_options);
new_info(re, NULL, PCRE_INFO_SIZE, &size);
new_info(re, NULL, PCRE_INFO_CAPTURECOUNT, &count);
new_info(re, NULL, PCRE_INFO_BACKREFMAX, &backrefmax);
@ -746,9 +769,9 @@ while (!done)
"First char disagreement: pcre_fullinfo=%d pcre_info=%d\n",
first_char, old_first_char);
if (old_options != options) fprintf(outfile,
"Options disagreement: pcre_fullinfo=%d pcre_info=%d\n", options,
old_options);
if (old_options != (int)get_options) fprintf(outfile,
"Options disagreement: pcre_fullinfo=%ld pcre_info=%d\n",
get_options, old_options);
}
if (size != gotten_store) fprintf(outfile,
@ -758,17 +781,17 @@ while (!done)
fprintf(outfile, "Capturing subpattern count = %d\n", count);
if (backrefmax > 0)
fprintf(outfile, "Max back reference = %d\n", backrefmax);
if (options == 0) fprintf(outfile, "No options\n");
if (get_options == 0) fprintf(outfile, "No options\n");
else fprintf(outfile, "Options:%s%s%s%s%s%s%s%s%s\n",
((options & PCRE_ANCHORED) != 0)? " anchored" : "",
((options & PCRE_CASELESS) != 0)? " caseless" : "",
((options & PCRE_EXTENDED) != 0)? " extended" : "",
((options & PCRE_MULTILINE) != 0)? " multiline" : "",
((options & PCRE_DOTALL) != 0)? " dotall" : "",
((options & PCRE_DOLLAR_ENDONLY) != 0)? " dollar_endonly" : "",
((options & PCRE_EXTRA) != 0)? " extra" : "",
((options & PCRE_UNGREEDY) != 0)? " ungreedy" : "",
((options & PCRE_UTF8) != 0)? " utf8" : "");
((get_options & PCRE_ANCHORED) != 0)? " anchored" : "",
((get_options & PCRE_CASELESS) != 0)? " caseless" : "",
((get_options & PCRE_EXTENDED) != 0)? " extended" : "",
((get_options & PCRE_MULTILINE) != 0)? " multiline" : "",
((get_options & PCRE_DOTALL) != 0)? " dotall" : "",
((get_options & PCRE_DOLLAR_ENDONLY) != 0)? " dollar_endonly" : "",
((get_options & PCRE_EXTRA) != 0)? " extra" : "",
((get_options & PCRE_UNGREEDY) != 0)? " ungreedy" : "",
((get_options & PCRE_UTF8) != 0)? " utf8" : "");
if (((((real_pcre *)re)->options) & PCRE_ICHANGED) != 0)
fprintf(outfile, "Case state changes\n");
@ -871,6 +894,8 @@ while (!done)
{
unsigned char *q;
unsigned char *bptr = dbuffer;
int *use_offsets = offsets;
int use_size_offsets = size_offsets;
int count, c;
int copystrings = 0;
int getstrings = 0;
@ -878,8 +903,6 @@ while (!done)
int gmatched = 0;
int start_offset = 0;
int g_notempty = 0;
int offsets[45];
int size_offsets = sizeof(offsets)/sizeof(int);
options = 0;
@ -987,7 +1010,20 @@ while (!done)
case 'O':
while(isdigit(*p)) n = n * 10 + *p++ - '0';
if (n <= (int)(sizeof(offsets)/sizeof(int))) size_offsets = n;
if (n > size_offsets_max)
{
size_offsets_max = n;
free(offsets);
use_offsets = offsets = malloc(size_offsets_max * sizeof(int));
if (offsets == NULL)
{
printf("** Failed to get %d bytes of memory for offsets vector\n",
size_offsets_max * sizeof(int));
return 1;
}
}
use_size_offsets = n;
if (n == 0) use_offsets = NULL;
continue;
case 'Z':
@ -1007,11 +1043,11 @@ while (!done)
{
int rc;
int eflags = 0;
regmatch_t pmatch[sizeof(offsets)/sizeof(int)];
regmatch_t *pmatch = malloc(sizeof(regmatch_t) * use_size_offsets);
if ((options & PCRE_NOTBOL) != 0) eflags |= REG_NOTBOL;
if ((options & PCRE_NOTEOL) != 0) eflags |= REG_NOTEOL;
rc = regexec(&preg, (const char *)bptr, size_offsets, pmatch, eflags);
rc = regexec(&preg, (const char *)bptr, use_size_offsets, pmatch, eflags);
if (rc != 0)
{
@ -1021,7 +1057,7 @@ while (!done)
else
{
size_t i;
for (i = 0; i < size_offsets; i++)
for (i = 0; i < use_size_offsets; i++)
{
if (pmatch[i].rm_so >= 0)
{
@ -1038,6 +1074,7 @@ while (!done)
}
}
}
free(pmatch);
}
/* Handle matching via the native interface - repeats for /g and /G */
@ -1054,7 +1091,7 @@ while (!done)
clock_t start_time = clock();
for (i = 0; i < LOOPREPEAT; i++)
count = pcre_exec(re, extra, (char *)bptr, len,
start_offset, options | g_notempty, offsets, size_offsets);
start_offset, options | g_notempty, use_offsets, use_size_offsets);
time_taken = clock() - start_time;
fprintf(outfile, "Execute time %.3f milliseconds\n",
((double)time_taken * 1000.0)/
@ -1062,12 +1099,12 @@ while (!done)
}
count = pcre_exec(re, extra, (char *)bptr, len,
start_offset, options | g_notempty, offsets, size_offsets);
start_offset, options | g_notempty, use_offsets, use_size_offsets);
if (count == 0)
{
fprintf(outfile, "Matched, but too many substrings\n");
count = size_offsets/3;
count = use_size_offsets/3;
}
/* Matched */
@ -1077,19 +1114,19 @@ while (!done)
int i;
for (i = 0; i < count * 2; i += 2)
{
if (offsets[i] < 0)
if (use_offsets[i] < 0)
fprintf(outfile, "%2d: <unset>\n", i/2);
else
{
fprintf(outfile, "%2d: ", i/2);
pchars(bptr + offsets[i], offsets[i+1] - offsets[i], utf8);
pchars(bptr + use_offsets[i], use_offsets[i+1] - use_offsets[i], utf8);
fprintf(outfile, "\n");
if (i == 0)
{
if (do_showrest)
{
fprintf(outfile, " 0+ ");
pchars(bptr + offsets[i+1], len - offsets[i+1], utf8);
pchars(bptr + use_offsets[i+1], len - use_offsets[i+1], utf8);
fprintf(outfile, "\n");
}
}
@ -1101,7 +1138,7 @@ while (!done)
if ((copystrings & (1 << i)) != 0)
{
char copybuffer[16];
int rc = pcre_copy_substring((char *)bptr, offsets, count,
int rc = pcre_copy_substring((char *)bptr, use_offsets, count,
i, copybuffer, sizeof(copybuffer));
if (rc < 0)
fprintf(outfile, "copy substring %d failed %d\n", i, rc);
@ -1115,7 +1152,7 @@ while (!done)
if ((getstrings & (1 << i)) != 0)
{
const char *substring;
int rc = pcre_get_substring((char *)bptr, offsets, count,
int rc = pcre_get_substring((char *)bptr, use_offsets, count,
i, &substring);
if (rc < 0)
fprintf(outfile, "get substring %d failed %d\n", i, rc);
@ -1131,7 +1168,7 @@ while (!done)
if (getlist)
{
const char **stringlist;
int rc = pcre_get_substring_list((char *)bptr, offsets, count,
int rc = pcre_get_substring_list((char *)bptr, use_offsets, count,
&stringlist);
if (rc < 0)
fprintf(outfile, "get substring list failed %d\n", rc);
@ -1157,8 +1194,8 @@ while (!done)
{
if (g_notempty != 0)
{
offsets[0] = start_offset;
offsets[1] = start_offset + 1;
use_offsets[0] = start_offset;
use_offsets[1] = start_offset + 1;
}
else
{
@ -1183,22 +1220,22 @@ while (!done)
character. */
g_notempty = 0;
if (offsets[0] == offsets[1])
if (use_offsets[0] == use_offsets[1])
{
if (offsets[0] == len) break;
if (use_offsets[0] == len) break;
g_notempty = PCRE_NOTEMPTY | PCRE_ANCHORED;
}
/* For /g, update the start offset, leaving the rest alone */
if (do_g) start_offset = offsets[1];
if (do_g) start_offset = use_offsets[1];
/* For /G, update the pointer and length */
else
{
bptr += offsets[1];
len -= offsets[1];
bptr += use_offsets[1];
len -= use_offsets[1];
}
} /* End of loop for /g and /G */
} /* End of loop for data lines */

View File

@ -9,7 +9,7 @@ the file Tech.Notes for some information on the internals.
Written by: Philip Hazel <ph10@cam.ac.uk>
Copyright (c) 1997-2000 University of Cambridge
Copyright (c) 1997-2001 University of Cambridge
-----------------------------------------------------------------------------
Permission is granted to anyone to use this software for any purpose on any
@ -104,8 +104,6 @@ do
while (try_next)
{
try_next = FALSE;
/* If a branch starts with a bracket or a positive lookahead assertion,
recurse to set bits from within them. That's all for this branch. */
@ -113,6 +111,7 @@ do
{
if (!set_start_bits(tcode, start_bits, caseless, cd))
return FALSE;
try_next = FALSE;
}
else switch(*tcode)
@ -120,12 +119,17 @@ do
default:
return FALSE;
/* Skip over extended extraction bracket number */
case OP_BRANUMBER:
tcode += 3;
break;
/* Skip over lookbehind and negative lookahead assertions */
case OP_ASSERT_NOT:
case OP_ASSERTBACK:
case OP_ASSERTBACK_NOT:
try_next = TRUE;
do tcode += (tcode[1] << 8) + tcode[2]; while (*tcode == OP_ALT);
tcode += 3;
break;
@ -135,7 +139,6 @@ do
case OP_OPT:
caseless = (tcode[1] & PCRE_CASELESS) != 0;
tcode += 2;
try_next = TRUE;
break;
/* BRAZERO does the bracket, but carries on. */
@ -147,7 +150,6 @@ do
dummy = 1;
do tcode += (tcode[1] << 8) + tcode[2]; while (*tcode == OP_ALT);
tcode += 3;
try_next = TRUE;
break;
/* Single-char * or ? sets the bit and tries the next item */
@ -158,7 +160,6 @@ do
case OP_MINQUERY:
set_bit(start_bits, tcode[1], caseless, cd);
tcode += 2;
try_next = TRUE;
break;
/* Single-char upto sets the bit and tries the next */
@ -167,7 +168,6 @@ do
case OP_MINUPTO:
set_bit(start_bits, tcode[3], caseless, cd);
tcode += 4;
try_next = TRUE;
break;
/* At least one single char sets the bit and stops */
@ -181,6 +181,7 @@ do
case OP_PLUS:
case OP_MINPLUS:
set_bit(start_bits, tcode[1], caseless, cd);
try_next = FALSE;
break;
/* Single character type sets the bits and stops */
@ -188,31 +189,37 @@ do
case OP_NOT_DIGIT:
for (c = 0; c < 32; c++)
start_bits[c] |= ~cd->cbits[c+cbit_digit];
try_next = FALSE;
break;
case OP_DIGIT:
for (c = 0; c < 32; c++)
start_bits[c] |= cd->cbits[c+cbit_digit];
try_next = FALSE;
break;
case OP_NOT_WHITESPACE:
for (c = 0; c < 32; c++)
start_bits[c] |= ~cd->cbits[c+cbit_space];
try_next = FALSE;
break;
case OP_WHITESPACE:
for (c = 0; c < 32; c++)
start_bits[c] |= cd->cbits[c+cbit_space];
try_next = FALSE;
break;
case OP_NOT_WORDCHAR:
for (c = 0; c < 32; c++)
start_bits[c] |= ~cd->cbits[c+cbit_word];
try_next = FALSE;
break;
case OP_WORDCHAR:
for (c = 0; c < 32; c++)
start_bits[c] |= cd->cbits[c+cbit_word];
try_next = FALSE;
break;
/* One or more character type fudges the pointer and restarts, knowing
@ -221,12 +228,10 @@ do
case OP_TYPEPLUS:
case OP_TYPEMINPLUS:
tcode++;
try_next = TRUE;
break;
case OP_TYPEEXACT:
tcode += 3;
try_next = TRUE;
break;
/* Zero or more repeats of character types set the bits and then
@ -274,7 +279,6 @@ do
}
tcode += 2;
try_next = TRUE;
break;
/* Character class: set the bits and either carry on or not,
@ -292,16 +296,16 @@ do
case OP_CRQUERY:
case OP_CRMINQUERY:
tcode++;
try_next = TRUE;
break;
case OP_CRRANGE:
case OP_CRMINRANGE:
if (((tcode[1] << 8) + tcode[2]) == 0)
{
tcode += 5;
try_next = TRUE;
}
if (((tcode[1] << 8) + tcode[2]) == 0) tcode += 5;
else try_next = FALSE;
break;
default:
try_next = FALSE;
break;
}
}

View File

@ -1442,10 +1442,6 @@
ABCabc
abcABC
/(main(O)?)+/
mainmain
mainOmain
/ab{3cd/
ab{3cd
@ -1919,4 +1915,37 @@
acb
a\nb
/^(b+?|a){1,2}?c/
bac
bbac
bbbac
bbbbac
bbbbbac
/^(b+|a){1,2}?c/
bac
bbac
bbbac
bbbbac
bbbbbac
/(?!\A)x/m
x\nb\n
a\bx\n
/\x0{ab}/
\0{ab}
/(A|B)*?CD/
CD
/(A|B)*CD/
CD
/(AB)*?\1/
ABABAB
/(AB)*\1/
ABABAB
/ End of testinput1 /

View File

@ -709,4 +709,15 @@
/^(?(0)f|b)oo/
/This one's here because of the large output vector needed/
/(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\w+)\s+(\270)/
\O900 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 ABC ABC
/This one's here because Perl does this differently and PCRE can't at present/
/(main(O)?)+/
mainmain
mainOmain
/ End of testinput2 /

View File

@ -27,6 +27,32 @@
/\xff/8D
/\x{0041}\x{2262}\x{0391}\x{002e}/D8
\x{0041}\x{2262}\x{0391}\x{002e}
/\x{D55c}\x{ad6d}\x{C5B4}/D8
\x{D55c}\x{ad6d}\x{C5B4}
/\x{65e5}\x{672c}\x{8a9e}/D8
\x{65e5}\x{672c}\x{8a9e}
/\x{80}/D8
/\x{084}/D8
/\x{104}/D8
/\x{861}/D8
/\x{212ab}/D8
/.{3,5}X/D8
\x{212ab}\x{212ab}\x{212ab}\x{861}X
/.{3,5}?/D8
\x{212ab}\x{212ab}\x{212ab}\x{861}
/-- These tests are here rather than in testinput5 because Perl 5.6 has --/
/-- some problems with UTF-8 support, in the area of \x{..} where the --/
/-- value is < 255. It grumbles about invalid UTF-8 strings. --/

View File

@ -1,4 +1,4 @@
PCRE version 3.4 22-Aug-2000
PCRE version 3.9 02-Jan-2002
/the quick brown fox/
the quick brown fox
@ -2080,15 +2080,6 @@ No match
0: abcABC
1: abc
/(main(O)?)+/
mainmain
0: mainmain
1: main
mainOmain
0: mainOmain
1: main
2: O
/ab{3cd/
ab{3cd
0: ab{3cd
@ -2962,5 +2953,67 @@ No match
a\nb
0: a\x0ab
/^(b+?|a){1,2}?c/
bac
0: bac
1: a
bbac
0: bbac
1: a
bbbac
0: bbbac
1: a
bbbbac
0: bbbbac
1: a
bbbbbac
0: bbbbbac
1: a
/^(b+|a){1,2}?c/
bac
0: bac
1: a
bbac
0: bbac
1: a
bbbac
0: bbbac
1: a
bbbbac
0: bbbbac
1: a
bbbbbac
0: bbbbbac
1: a
/(?!\A)x/m
x\nb\n
No match
a\bx\n
0: x
/\x0{ab}/
\0{ab}
0: \x00{ab}
/(A|B)*?CD/
CD
0: CD
/(A|B)*CD/
CD
0: CD
/(AB)*?\1/
ABABAB
0: ABAB
1: AB
/(AB)*\1/
ABABAB
0: ABABAB
1: AB
/ End of testinput1 /

View File

@ -1,4 +1,4 @@
PCRE version 3.4 22-Aug-2000
PCRE version 3.9 02-Jan-2002
/(a)b|/
Capturing subpattern count = 1
@ -2067,6 +2067,311 @@ Failed: range out of order in character class at offset 9
/^(?(0)f|b)oo/
Failed: invalid condition (?(0) at offset 5
/This one's here because of the large output vector needed/
Capturing subpattern count = 0
No options
First char = 'T'
Need char = 'd'
/(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\d+(?:\s|$))(\w+)\s+(\270)/
Capturing subpattern count = 271
Max back reference = 270
No options
No first char
No need char
\O900 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 ABC ABC
0: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 ABC ABC
1: 1
2: 2
3: 3
4: 4
5: 5
6: 6
7: 7
8: 8
9: 9
10: 10
11: 11
12: 12
13: 13
14: 14
15: 15
16: 16
17: 17
18: 18
19: 19
20: 20
21: 21
22: 22
23: 23
24: 24
25: 25
26: 26
27: 27
28: 28
29: 29
30: 30
31: 31
32: 32
33: 33
34: 34
35: 35
36: 36
37: 37
38: 38
39: 39
40: 40
41: 41
42: 42
43: 43
44: 44
45: 45
46: 46
47: 47
48: 48
49: 49
50: 50
51: 51
52: 52
53: 53
54: 54
55: 55
56: 56
57: 57
58: 58
59: 59
60: 60
61: 61
62: 62
63: 63
64: 64
65: 65
66: 66
67: 67
68: 68
69: 69
70: 70
71: 71
72: 72
73: 73
74: 74
75: 75
76: 76
77: 77
78: 78
79: 79
80: 80
81: 81
82: 82
83: 83
84: 84
85: 85
86: 86
87: 87
88: 88
89: 89
90: 90
91: 91
92: 92
93: 93
94: 94
95: 95
96: 96
97: 97
98: 98
99: 99
100: 100
101: 101
102: 102
103: 103
104: 104
105: 105
106: 106
107: 107
108: 108
109: 109
110: 110
111: 111
112: 112
113: 113
114: 114
115: 115
116: 116
117: 117
118: 118
119: 119
120: 120
121: 121
122: 122
123: 123
124: 124
125: 125
126: 126
127: 127
128: 128
129: 129
130: 130
131: 131
132: 132
133: 133
134: 134
135: 135
136: 136
137: 137
138: 138
139: 139
140: 140
141: 141
142: 142
143: 143
144: 144
145: 145
146: 146
147: 147
148: 148
149: 149
150: 150
151: 151
152: 152
153: 153
154: 154
155: 155
156: 156
157: 157
158: 158
159: 159
160: 160
161: 161
162: 162
163: 163
164: 164
165: 165
166: 166
167: 167
168: 168
169: 169
170: 170
171: 171
172: 172
173: 173
174: 174
175: 175
176: 176
177: 177
178: 178
179: 179
180: 180
181: 181
182: 182
183: 183
184: 184
185: 185
186: 186
187: 187
188: 188
189: 189
190: 190
191: 191
192: 192
193: 193
194: 194
195: 195
196: 196
197: 197
198: 198
199: 199
200: 200
201: 201
202: 202
203: 203
204: 204
205: 205
206: 206
207: 207
208: 208
209: 209
210: 210
211: 211
212: 212
213: 213
214: 214
215: 215
216: 216
217: 217
218: 218
219: 219
220: 220
221: 221
222: 222
223: 223
224: 224
225: 225
226: 226
227: 227
228: 228
229: 229
230: 230
231: 231
232: 232
233: 233
234: 234
235: 235
236: 236
237: 237
238: 238
239: 239
240: 240
241: 241
242: 242
243: 243
244: 244
245: 245
246: 246
247: 247
248: 248
249: 249
250: 250
251: 251
252: 252
253: 253
254: 254
255: 255
256: 256
257: 257
258: 258
259: 259
260: 260
261: 261
262: 262
263: 263
264: 264
265: 265
266: 266
267: 267
268: 268
269: 269
270: ABC
271: ABC
/This one's here because Perl does this differently and PCRE can't at present/
Capturing subpattern count = 0
No options
First char = 'T'
Need char = 't'
/(main(O)?)+/
Capturing subpattern count = 2
No options
First char = 'm'
Need char = 'n'
mainmain
0: mainmain
1: main
mainOmain
0: mainOmain
1: main
2: O
/ End of testinput2 /
Capturing subpattern count = 0
No options

View File

@ -1,4 +1,4 @@
PCRE version 3.4 22-Aug-2000
PCRE version 3.9 02-Jan-2002
/(?<!bar)foo/
foo

View File

@ -1,4 +1,4 @@
PCRE version 3.4 22-Aug-2000
PCRE version 3.9 02-Jan-2002
/^[\w]+/
*** Failers

View File

@ -1,4 +1,4 @@
PCRE version 3.4 22-Aug-2000
PCRE version 3.9 02-Jan-2002
/-- Because of problems with Perl 5.6 in handling UTF-8 vs non UTF-8 --/
/-- strings automatically, do not use the \x{} construct except with --/

View File

@ -1,82 +1,82 @@
PCRE version 3.4 22-Aug-2000
PCRE version 3.9 02-Jan-2002
/\x{100}/8DM
Memory allocation (code space): 11
------------------------------------------------------------------
0 7 Bra 0
3 2 \xc0\x88
3 2 \xc4\x80
7 7 Ket
10 End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: utf8
First char = 192
Need char = 136
First char = 196
Need char = 128
/\x{1000}/8DM
Memory allocation (code space): 12
------------------------------------------------------------------
0 8 Bra 0
3 3 \xe0\x80\x84
3 3 \xe1\x80\x80
8 8 Ket
11 End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: utf8
First char = 224
Need char = 132
First char = 225
Need char = 128
/\x{10000}/8DM
Memory allocation (code space): 13
------------------------------------------------------------------
0 9 Bra 0
3 4 \xf0\x80\x80\x82
3 4 \xf0\x90\x80\x80
9 9 Ket
12 End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: utf8
First char = 240
Need char = 130
Need char = 128
/\x{100000}/8DM
Memory allocation (code space): 13
------------------------------------------------------------------
0 9 Bra 0
3 4 \xf0\x80\x80\xa0
3 4 \xf4\x80\x80\x80
9 9 Ket
12 End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: utf8
First char = 240
Need char = 160
First char = 244
Need char = 128
/\x{1000000}/8DM
Memory allocation (code space): 14
------------------------------------------------------------------
0 10 Bra 0
3 5 \xf8\x80\x80\x80\x90
3 5 \xf9\x80\x80\x80\x80
10 10 Ket
13 End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: utf8
First char = 248
Need char = 144
First char = 249
Need char = 128
/\x{4000000}/8DM
Memory allocation (code space): 15
------------------------------------------------------------------
0 11 Bra 0
3 6 \xfc\x80\x80\x80\x80\x82
3 6 \xfc\x84\x80\x80\x80\x80
11 11 Ket
14 End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: utf8
First char = 252
Need char = 130
Need char = 128
/\x{7fffFFFF}/8DM
Memory allocation (code space): 15
@ -121,26 +121,160 @@ Failed: character value in \x{...} sequence is too large at offset 12
/\x80/8D
------------------------------------------------------------------
0 7 Bra 0
3 2 \xc0\x84
3 2 \xc2\x80
7 7 Ket
10 End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: utf8
First char = 192
Need char = 132
First char = 194
Need char = 128
/\xff/8D
------------------------------------------------------------------
0 7 Bra 0
3 2 \xdf\x87
3 2 \xc3\xbf
7 7 Ket
10 End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: utf8
First char = 223
Need char = 135
First char = 195
Need char = 191
/\x{0041}\x{2262}\x{0391}\x{002e}/D8
------------------------------------------------------------------
0 12 Bra 0
3 7 A\xe2\x89\xa2\xce\x91.
12 12 Ket
15 End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: utf8
First char = 'A'
Need char = '.'
\x{0041}\x{2262}\x{0391}\x{002e}
0: A\x{2262}\x{391}.
/\x{D55c}\x{ad6d}\x{C5B4}/D8
------------------------------------------------------------------
0 14 Bra 0
3 9 \xed\x95\x9c\xea\xb5\xad\xec\x96\xb4
14 14 Ket
17 End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: utf8
First char = 237
Need char = 180
\x{D55c}\x{ad6d}\x{C5B4}
0: \x{d55c}\x{ad6d}\x{c5b4}
/\x{65e5}\x{672c}\x{8a9e}/D8
------------------------------------------------------------------
0 14 Bra 0
3 9 \xe6\x97\xa5\xe6\x9c\xac\xe8\xaa\x9e
14 14 Ket
17 End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: utf8
First char = 230
Need char = 158
\x{65e5}\x{672c}\x{8a9e}
0: \x{65e5}\x{672c}\x{8a9e}
/\x{80}/D8
------------------------------------------------------------------
0 7 Bra 0
3 2 \xc2\x80
7 7 Ket
10 End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: utf8
First char = 194
Need char = 128
/\x{084}/D8
------------------------------------------------------------------
0 7 Bra 0
3 2 \xc2\x84
7 7 Ket
10 End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: utf8
First char = 194
Need char = 132
/\x{104}/D8
------------------------------------------------------------------
0 7 Bra 0
3 2 \xc4\x84
7 7 Ket
10 End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: utf8
First char = 196
Need char = 132
/\x{861}/D8
------------------------------------------------------------------
0 8 Bra 0
3 3 \xe0\xa1\xa1
8 8 Ket
11 End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: utf8
First char = 224
Need char = 161
/\x{212ab}/D8
------------------------------------------------------------------
0 9 Bra 0
3 4 \xf0\xa1\x8a\xab
9 9 Ket
12 End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: utf8
First char = 240
Need char = 171
/.{3,5}X/D8
------------------------------------------------------------------
0 14 Bra 0
3 Any{3}
7 Any{0,2}
11 1 X
14 14 Ket
17 End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: utf8
No first char
Need char = 'X'
\x{212ab}\x{212ab}\x{212ab}\x{861}X
0: \x{212ab}\x{212ab}\x{212ab}\x{861}X
/.{3,5}?/D8
------------------------------------------------------------------
0 11 Bra 0
3 Any{3}
7 Any{0,2}?
11 11 Ket
14 End
------------------------------------------------------------------
Capturing subpattern count = 0
Options: utf8
No first char
No need char
\x{212ab}\x{212ab}\x{212ab}\x{861}
0: \x{212ab}\x{212ab}\x{212ab}
/-- These tests are here rather than in testinput5 because Perl 5.6 has --/
/-- some problems with UTF-8 support, in the area of \x{..} where the --/