upgrade to pcre 7.7

This commit is contained in:
Nuno Lopes 2008-07-06 15:18:00 +00:00
parent 453e502236
commit 8a5db93312
21 changed files with 2011 additions and 1093 deletions

View File

@ -1,6 +1,114 @@
ChangeLog for PCRE
------------------
Version 7.7 07-May-08
---------------------
1. Applied Craig's patch to sort out a long long problem: "If we can't convert
a string to a long long, pretend we don't even have a long long." This is
done by checking for the strtoq, strtoll, and _strtoi64 functions.
2. Applied Craig's patch to pcrecpp.cc to restore ABI compatibility with
pre-7.6 versions, which defined a global no_arg variable instead of putting
it in the RE class. (See also #8 below.)
3. Remove a line of dead code, identified by coverity and reported by Nuno
Lopes.
4. Fixed two related pcregrep bugs involving -r with --include or --exclude:
(1) The include/exclude patterns were being applied to the whole pathnames
of files, instead of just to the final components.
(2) If there was more than one level of directory, the subdirectories were
skipped unless they satisfied the include/exclude conditions. This is
inconsistent with GNU grep (and could even be seen as contrary to the
pcregrep specification - which I improved to make it absolutely clear).
The action now is always to scan all levels of directory, and just
apply the include/exclude patterns to regular files.
5. Added the --include_dir and --exclude_dir patterns to pcregrep, and used
--exclude_dir in the tests to avoid scanning .svn directories.
6. Applied Craig's patch to the QuoteMeta function so that it escapes the
NUL character as backslash + 0 rather than backslash + NUL, because PCRE
doesn't support NULs in patterns.
7. Added some missing "const"s to declarations of static tables in
pcre_compile.c and pcre_dfa_exec.c.
8. Applied Craig's patch to pcrecpp.cc to fix a problem in OS X that was
caused by fix #2 above. (Subsequently also a second patch to fix the
first patch. And a third patch - this was a messy problem.)
9. Applied Craig's patch to remove the use of push_back().
10. Applied Alan Lehotsky's patch to add REG_STARTEND support to the POSIX
matching function regexec().
11. Added support for the Oniguruma syntax \g<name>, \g<n>, \g'name', \g'n',
which, however, unlike Perl's \g{...}, are subroutine calls, not back
references. PCRE supports relative numbers with this syntax (I don't think
Oniguruma does).
12. Previously, a group with a zero repeat such as (...){0} was completely
omitted from the compiled regex. However, this means that if the group
was called as a subroutine from elsewhere in the pattern, things went wrong
(an internal error was given). Such groups are now left in the compiled
pattern, with a new opcode that causes them to be skipped at execution
time.
13. Added the PCRE_JAVASCRIPT_COMPAT option. This makes the following changes
to the way PCRE behaves:
(a) A lone ] character is dis-allowed (Perl treats it as data).
(b) A back reference to an unmatched subpattern matches an empty string
(Perl fails the current match path).
(c) A data ] in a character class must be notated as \] because if the
first data character in a class is ], it defines an empty class. (In
Perl it is not possible to have an empty class.) The empty class []
never matches; it forces failure and is equivalent to (*FAIL) or (?!).
The negative empty class [^] matches any one character, independently
of the DOTALL setting.
14. A pattern such as /(?2)[]a()b](abc)/ which had a forward reference to a
non-existent subpattern following a character class starting with ']' and
containing () gave an internal compiling error instead of "reference to
non-existent subpattern". Fortunately, when the pattern did exist, the
compiled code was correct. (When scanning forwards to check for the
existencd of the subpattern, it was treating the data ']' as terminating
the class, so got the count wrong. When actually compiling, the reference
was subsequently set up correctly.)
15. The "always fail" assertion (?!) is optimzed to (*FAIL) by pcre_compile;
it was being rejected as not supported by pcre_dfa_exec(), even though
other assertions are supported. I have made pcre_dfa_exec() support
(*FAIL).
16. The implementation of 13c above involved the invention of a new opcode,
OP_ALLANY, which is like OP_ANY but doesn't check the /s flag. Since /s
cannot be changed at match time, I realized I could make a small
improvement to matching performance by compiling OP_ALLANY instead of
OP_ANY for "." when DOTALL was set, and then removing the runtime tests
on the OP_ANY path.
17. Compiling pcretest on Windows with readline support failed without the
following two fixes: (1) Make the unistd.h include conditional on
HAVE_UNISTD_H; (2) #define isatty and fileno as _isatty and _fileno.
18. Changed CMakeLists.txt and cmake/FindReadline.cmake to arrange for the
ncurses library to be included for pcretest when ReadLine support is
requested, but also to allow for it to be overridden. This patch came from
Daniel Bergström.
19. There was a typo in the file ucpinternal.h where f0_rangeflag was defined
as 0x00f00000 instead of 0x00800000. Luckily, this would not have caused
any errors with the current Unicode tables. Thanks to Peter Kankowski for
spotting this.
Version 7.6 28-Jan-08
---------------------

View File

@ -125,7 +125,8 @@ Opcodes with no following data
These items are all just one byte long
OP_END end of pattern
OP_ANY match any character
OP_ANY match any one character other than newline
OP_ALLANY match any one character, including newline
OP_ANYBYTE match any single byte, even in UTF-8 mode
OP_SOD match start of data: \A
OP_SOM, start of match (subject + offset): \G
@ -318,9 +319,12 @@ maximally respectively. All three are followed by LINK_SIZE bytes giving (as a
positive number) the offset back to the matching bracket opcode.
If a subpattern is quantified such that it is permitted to match zero times, it
is preceded by one of OP_BRAZERO or OP_BRAMINZERO. These are single-byte
opcodes which tell the matcher that skipping this subpattern entirely is a
valid branch.
is preceded by one of OP_BRAZERO, OP_BRAMINZERO, or OP_SKIPZERO. These are
single-byte opcodes that tell the matcher that skipping the following
subpattern entirely is a valid branch. In the case of the first two, not
skipping the pattern is also valid (greedy and non-greedy). The third is used
when a pattern has the quantifier {0,0}. It cannot be entirely discarded,
because it may be called as a subroutine from elsewhere in the regex.
A subpattern with an indefinite maximum repetition is replicated in the
compiled data its minimum number of times (or once with OP_BRAZERO if the
@ -411,4 +415,4 @@ at compile time, and so does not cause anything to be put into the compiled
data.
Philip Hazel
August 2007
April 2008

View File

@ -1,6 +1,14 @@
News about PCRE releases
------------------------
Release 7.7 07-May-08
---------------------
This is once again mainly a bug-fix release, but there are a couple of new
features.
Release 7.6 28-Jan-08
---------------------

View File

@ -276,6 +276,15 @@ library. You can read more about them in the pcrebuild man page.
Note that libreadline is GPL-licenced, so if you distribute a binary of
pcretest linked in this way, there may be licensing issues.
Setting this option causes the -lreadline option to be added to the pcretest
build. In many operating environments with a sytem-installed readline
library this is sufficient. However, in some environments (e.g. if an
unmodified distribution version of readline is in use), it may be necessary
to specify something like LIBS="-lncurses" as well. This is because, to quote
the readline INSTALL, "Readline uses the termcap functions, but does not link
with the termcap or curses library itself, allowing applications which link
with readline the to choose an appropriate library."
The "configure" script builds the following files for the basic C library:
. Makefile is the makefile that builds the library
@ -740,4 +749,4 @@ The distribution should contain the following files:
Philip Hazel
Email local part: ph10
Email domain: cam.ac.uk
Last updated: 25 January 2008
Last updated: 13 April 2008

View File

@ -132,9 +132,7 @@ them both to 0; an emulation function will be used. */
#endif
/* Define to 1 if you have the `strtoll' function. */
#ifndef HAVE_STRTOLL
#define HAVE_STRTOLL 1
#endif
/* #undef HAVE_STRTOLL */
/* Define to 1 if you have the `strtoq' function. */
#ifndef HAVE_STRTOQ
@ -251,13 +249,13 @@ them both to 0; an emulation function will be used. */
#define PACKAGE_NAME "PCRE"
/* Define to the full name and version of this package. */
#define PACKAGE_STRING "PCRE 7.6"
#define PACKAGE_STRING "PCRE 7.7"
/* Define to the one symbol short name of this package. */
#define PACKAGE_TARNAME "pcre"
/* Define to the version of this package. */
#define PACKAGE_VERSION "7.6"
#define PACKAGE_VERSION "7.7"
/* If you are compiling for a system other than a Unix-like system or
@ -310,7 +308,7 @@ them both to 0; an emulation function will be used. */
/* Version number of package */
#ifndef VERSION
#define VERSION "7.6"
#define VERSION "7.7"
#endif
/* Define to empty if `const' does not conform to ANSI C. */

File diff suppressed because it is too large Load Diff

View File

@ -42,9 +42,9 @@ POSSIBILITY OF SUCH DAMAGE.
/* The current PCRE version information. */
#define PCRE_MAJOR 7
#define PCRE_MINOR 6
#define PCRE_MINOR 7
#define PCRE_PRERELEASE
#define PCRE_DATE 2008-01-28
#define PCRE_DATE 2008-05-07
/* When an application links to a PCRE DLL in Windows, the symbols that are
imported have to be identified as such. When building PCRE, the appropriate
@ -124,6 +124,7 @@ extern "C" {
#define PCRE_NEWLINE_ANYCRLF 0x00500000
#define PCRE_BSR_ANYCRLF 0x00800000
#define PCRE_BSR_UNICODE 0x01000000
#define PCRE_JAVASCRIPT_COMPAT 0x02000000
/* Exec-time and get/set-time error codes */

View File

@ -156,7 +156,7 @@ static const char verbnames[] =
"SKIP\0"
"THEN";
static verbitem verbs[] = {
static const verbitem verbs[] = {
{ 6, OP_ACCEPT },
{ 6, OP_COMMIT },
{ 1, OP_FAIL },
@ -166,7 +166,7 @@ static verbitem verbs[] = {
{ 4, OP_THEN }
};
static int verbcount = sizeof(verbs)/sizeof(verbitem);
static const int verbcount = sizeof(verbs)/sizeof(verbitem);
/* Tables of names of POSIX character classes and their lengths. The names are
@ -293,14 +293,15 @@ static const char error_texts[] =
/* 55 */
"repeating a DEFINE group is not allowed\0"
"inconsistent NEWLINE options\0"
"\\g is not followed by a braced name or an optionally braced non-zero number\0"
"(?+ or (?- or (?(+ or (?(- must be followed by a non-zero number\0"
"\\g is not followed by a braced, angle-bracketed, or quoted name/number or by a plain number\0"
"a numbered reference must not be zero\0"
"(*VERB) with an argument is not supported\0"
/* 60 */
"(*VERB) not recognized\0"
"number is too big\0"
"subpattern name expected\0"
"digit expected after (?+";
"digit expected after (?+\0"
"] is an invalid data character in JavaScript compatibility mode";
/* Table to identify digits and hex digits. This is used when compiling
@ -529,14 +530,31 @@ else
*errorcodeptr = ERR37;
break;
/* \g must be followed by a number, either plain or braced. If positive, it
is an absolute backreference. If negative, it is a relative backreference.
This is a Perl 5.10 feature. Perl 5.10 also supports \g{name} as a
reference to a named group. This is part of Perl's movement towards a
unified syntax for back references. As this is synonymous with \k{name}, we
fudge it up by pretending it really was \k. */
/* \g must be followed by one of a number of specific things:
(1) A number, either plain or braced. If positive, it is an absolute
backreference. If negative, it is a relative backreference. This is a Perl
5.10 feature.
(2) Perl 5.10 also supports \g{name} as a reference to a named group. This
is part of Perl's movement towards a unified syntax for back references. As
this is synonymous with \k{name}, we fudge it up by pretending it really
was \k.
(3) For Oniguruma compatibility we also support \g followed by a name or a
number either in angle brackets or in single quotes. However, these are
(possibly recursive) subroutine calls, _not_ backreferences. Just return
the -ESC_g code (cf \k). */
case 'g':
if (ptr[1] == '<' || ptr[1] == '\'')
{
c = -ESC_g;
break;
}
/* Handle the Perl-compatible cases */
if (ptr[1] == '{')
{
const uschar *p;
@ -563,18 +581,24 @@ else
while ((digitab[ptr[1]] & ctype_digit) != 0)
c = c * 10 + *(++ptr) - '0';
if (c < 0)
if (c < 0) /* Integer overflow */
{
*errorcodeptr = ERR61;
break;
}
if (c == 0 || (braced && *(++ptr) != '}'))
if (braced && *(++ptr) != '}')
{
*errorcodeptr = ERR57;
break;
}
if (c == 0)
{
*errorcodeptr = ERR58;
break;
}
if (negated)
{
if (c > bracount)
@ -609,7 +633,7 @@ else
c -= '0';
while ((digitab[ptr[1]] & ctype_digit) != 0)
c = c * 10 + *(++ptr) - '0';
if (c < 0)
if (c < 0) /* Integer overflow */
{
*errorcodeptr = ERR61;
break;
@ -950,7 +974,7 @@ be terminated by '>' because that is checked in the first pass.
Arguments:
ptr current position in the pattern
count current count of capturing parens so far encountered
cd compile background data
name name to seek, or NULL if seeking a numbered subpattern
lorn name length, or subpattern number if name is NULL
xmode TRUE if we are in /x mode
@ -959,10 +983,11 @@ Returns: the number of the named subpattern, or -1 if not found
*/
static int
find_parens(const uschar *ptr, int count, const uschar *name, int lorn,
find_parens(const uschar *ptr, compile_data *cd, const uschar *name, int lorn,
BOOL xmode)
{
const uschar *thisname;
int count = cd->bracount;
for (; *ptr != 0; ptr++)
{
@ -982,10 +1007,34 @@ for (; *ptr != 0; ptr++)
continue;
}
/* Skip over character classes */
/* Skip over character classes; this logic must be similar to the way they
are handled for real. If the first character is '^', skip it. Also, if the
first few characters (either before or after ^) are \Q\E or \E we skip them
too. This makes for compatibility with Perl. */
if (*ptr == '[')
{
BOOL negate_class = FALSE;
for (;;)
{
int c = *(++ptr);
if (c == '\\')
{
if (ptr[1] == 'E') ptr++;
else if (strncmp((const char *)ptr+1, "Q\\E", 3) == 0) ptr += 3;
else break;
}
else if (!negate_class && c == '^')
negate_class = TRUE;
else break;
}
/* If the next character is ']', it is a data character that must be
skipped, except in JavaScript compatibility mode. */
if (ptr[1] == ']' && (cd->external_options & PCRE_JAVASCRIPT_COMPAT) == 0)
ptr++;
while (*(++ptr) != ']')
{
if (*ptr == 0) return -1;
@ -1250,6 +1299,7 @@ for (;;)
case OP_NOT_WORDCHAR:
case OP_WORDCHAR:
case OP_ANY:
case OP_ALLANY:
branchlength++;
cc++;
break;
@ -1542,7 +1592,7 @@ for (code = first_significant_code(code + _pcre_OP_lengths[*code], NULL, 0, TRUE
/* Groups with zero repeats can of course be empty; skip them. */
if (c == OP_BRAZERO || c == OP_BRAMINZERO)
if (c == OP_BRAZERO || c == OP_BRAMINZERO || c == OP_SKIPZERO)
{
code += _pcre_OP_lengths[c];
do code += GET(code, 1); while (*code == OP_ALT);
@ -1628,6 +1678,7 @@ for (code = first_significant_code(code + _pcre_OP_lengths[*code], NULL, 0, TRUE
case OP_NOT_WORDCHAR:
case OP_WORDCHAR:
case OP_ANY:
case OP_ALLANY:
case OP_ANYBYTE:
case OP_CHAR:
case OP_CHARNC:
@ -1822,11 +1873,12 @@ return -1;
that is referenced. This means that groups can be replicated for fixed
repetition simply by copying (because the recursion is allowed to refer to
earlier groups that are outside the current group). However, when a group is
optional (i.e. the minimum quantifier is zero), OP_BRAZERO is inserted before
it, after it has been compiled. This means that any OP_RECURSE items within it
that refer to the group itself or any contained groups have to have their
offsets adjusted. That one of the jobs of this function. Before it is called,
the partially compiled regex must be temporarily terminated with OP_END.
optional (i.e. the minimum quantifier is zero), OP_BRAZERO or OP_SKIPZERO is
inserted before it, after it has been compiled. This means that any OP_RECURSE
items within it that refer to the group itself or any contained groups have to
have their offsets adjusted. That one of the jobs of this function. Before it
is called, the partially compiled regex must be temporarily terminated with
OP_END.
This function has been extended with the possibility of forward references for
recursions and subroutine calls. It must also check the list of such references
@ -2111,7 +2163,6 @@ if (next >= 0) switch(op_code)
/* For OP_NOT, "item" must be a single-byte character. */
case OP_NOT:
if (next < 0) return FALSE; /* Not a character */
if (item == next) return TRUE;
if ((options & PCRE_CASELESS) == 0) return FALSE;
#ifdef SUPPORT_UTF8
@ -2614,7 +2665,7 @@ for (;; ptr++)
zerofirstbyte = firstbyte;
zeroreqbyte = reqbyte;
previous = code;
*code++ = OP_ANY;
*code++ = ((options & PCRE_DOTALL) != 0)? OP_ALLANY: OP_ANY;
break;
@ -2629,7 +2680,17 @@ for (;; ptr++)
opcode is compiled. It may optionally have a bit map for characters < 256,
but those above are are explicitly listed afterwards. A flag byte tells
whether the bitmap is present, and whether this is a negated class or not.
*/
In JavaScript compatibility mode, an isolated ']' causes an error. In
default (Perl) mode, it is treated as a data character. */
case ']':
if ((cd->external_options & PCRE_JAVASCRIPT_COMPAT) != 0)
{
*errorcodeptr = ERR64;
goto FAILED;
}
goto NORMAL_CHAR;
case '[':
previous = code;
@ -2663,6 +2724,19 @@ for (;; ptr++)
else break;
}
/* Empty classes are allowed in JavaScript compatibility mode. Otherwise,
an initial ']' is taken as a data character -- the code below handles
that. In JS mode, [] must always fail, so generate OP_FAIL, whereas
[^] must match any character, so generate OP_ALLANY. */
if (c ==']' && (cd->external_options & PCRE_JAVASCRIPT_COMPAT) != 0)
{
*code++ = negate_class? OP_ALLANY : OP_FAIL;
if (firstbyte == REQ_UNSET) firstbyte = REQ_NONE;
zerofirstbyte = firstbyte;
break;
}
/* If a class contains a negative special such as \S, we need to flip the
negation flag at the end, so that support for characters > 255 works
correctly (they are all included in the class). */
@ -3818,28 +3892,38 @@ we set the flag only if there is a literal "\r" or "\n" in the class. */
if (repeat_min == 0)
{
/* If the maximum is also zero, we just omit the group from the output
altogether. */
/* If the maximum is also zero, we used to just omit the group from the
output altogether, like this:
if (repeat_max == 0)
{
code = previous;
goto END_REPEAT;
}
** if (repeat_max == 0)
** {
** code = previous;
** goto END_REPEAT;
** }
/* If the maximum is 1 or unlimited, we just have to stick in the
BRAZERO and do no more at this point. However, we do need to adjust
any OP_RECURSE calls inside the group that refer to the group itself or
any internal or forward referenced group, because the offset is from
the start of the whole regex. Temporarily terminate the pattern while
doing this. */
However, that fails when a group is referenced as a subroutine from
elsewhere in the pattern, so now we stick in OP_SKIPZERO in front of it
so that it is skipped on execution. As we don't have a list of which
groups are referenced, we cannot do this selectively.
if (repeat_max <= 1)
If the maximum is 1 or unlimited, we just have to stick in the BRAZERO
and do no more at this point. However, we do need to adjust any
OP_RECURSE calls inside the group that refer to the group itself or any
internal or forward referenced group, because the offset is from the
start of the whole regex. Temporarily terminate the pattern while doing
this. */
if (repeat_max <= 1) /* Covers 0, 1, and unlimited */
{
*code = OP_END;
adjust_recurse(previous, 1, utf8, cd, save_hwm);
memmove(previous+1, previous, len);
code++;
if (repeat_max == 0)
{
*previous++ = OP_SKIPZERO;
goto END_REPEAT;
}
*previous++ = OP_BRAZERO + repeat_type;
}
@ -4034,6 +4118,13 @@ we set the flag only if there is a literal "\r" or "\n" in the class. */
}
}
/* If previous is OP_FAIL, it was generated by an empty class [] in
JavaScript mode. The other ways in which OP_FAIL can be generated, that is
by (*FAIL) or (?!) set previous to NULL, which gives a "nothing to repeat"
error above. We can just ignore the repeat in JS case. */
else if (*previous == OP_FAIL) goto END_REPEAT;
/* Else there's some kind of shambles */
else
@ -4320,7 +4411,7 @@ we set the flag only if there is a literal "\r" or "\n" in the class. */
/* Search the pattern for a forward reference */
else if ((i = find_parens(ptr, cd->bracount, name, namelen,
else if ((i = find_parens(ptr, cd, name, namelen,
(options & PCRE_EXTENDED) != 0)) > 0)
{
PUT2(code, 2+LINK_SIZE, i);
@ -4566,7 +4657,7 @@ we set the flag only if there is a literal "\r" or "\n" in the class. */
references (?P=name) and recursion (?P>name), as well as falling
through from the Perl recursion syntax (?&name). We also come here from
the Perl \k<name> or \k'name' back reference syntax and the \k{name}
.NET syntax. */
.NET syntax, and the Oniguruma \g<...> and \g'...' subroutine syntax. */
NAMED_REF_OR_RECURSE:
name = ++ptr;
@ -4617,7 +4708,7 @@ we set the flag only if there is a literal "\r" or "\n" in the class. */
recno = GET2(slot, 0);
}
else if ((recno = /* Forward back reference */
find_parens(ptr, cd->bracount, name, namelen,
find_parens(ptr, cd, name, namelen,
(options & PCRE_EXTENDED) != 0)) <= 0)
{
*errorcodeptr = ERR15;
@ -4644,6 +4735,15 @@ we set the flag only if there is a literal "\r" or "\n" in the class. */
case '5': case '6': case '7': case '8': case '9': /* subroutine */
{
const uschar *called;
terminator = ')';
/* Come here from the \g<...> and \g'...' code (Oniguruma
compatibility). However, the syntax has been checked to ensure that
the ... are a (signed) number, so that neither ERR63 nor ERR29 will
be called on this path, nor with the jump to OTHER_CHAR_AFTER_QUERY
ever be taken. */
HANDLE_NUMERICAL_RECURSION:
if ((refsign = *ptr) == '+')
{
@ -4665,7 +4765,7 @@ we set the flag only if there is a literal "\r" or "\n" in the class. */
while((digitab[*ptr] & ctype_digit) != 0)
recno = recno * 10 + *ptr++ - '0';
if (*ptr != ')')
if (*ptr != terminator)
{
*errorcodeptr = ERR29;
goto FAILED;
@ -4718,8 +4818,8 @@ we set the flag only if there is a literal "\r" or "\n" in the class. */
if (called == NULL)
{
if (find_parens(ptr, cd->bracount, NULL, recno,
(options & PCRE_EXTENDED) != 0) < 0)
if (find_parens(ptr, cd, NULL, recno,
(options & PCRE_EXTENDED) != 0) < 0)
{
*errorcodeptr = ERR15;
goto FAILED;
@ -5089,6 +5189,64 @@ we set the flag only if there is a literal "\r" or "\n" in the class. */
zerofirstbyte = firstbyte;
zeroreqbyte = reqbyte;
/* \g<name> or \g'name' is a subroutine call by name and \g<n> or \g'n'
is a subroutine call by number (Oniguruma syntax). In fact, the value
-ESC_g is returned only for these cases. So we don't need to check for <
or ' if the value is -ESC_g. For the Perl syntax \g{n} the value is
-ESC_REF+n, and for the Perl syntax \g{name} the result is -ESC_k (as
that is a synonym for a named back reference). */
if (-c == ESC_g)
{
const uschar *p;
save_hwm = cd->hwm; /* Normally this is set when '(' is read */
terminator = (*(++ptr) == '<')? '>' : '\'';
/* These two statements stop the compiler for warning about possibly
unset variables caused by the jump to HANDLE_NUMERICAL_RECURSION. In
fact, because we actually check for a number below, the paths that
would actually be in error are never taken. */
skipbytes = 0;
reset_bracount = FALSE;
/* Test for a name */
if (ptr[1] != '+' && ptr[1] != '-')
{
BOOL isnumber = TRUE;
for (p = ptr + 1; *p != 0 && *p != terminator; p++)
{
if ((cd->ctypes[*p] & ctype_digit) == 0) isnumber = FALSE;
if ((cd->ctypes[*p] & ctype_word) == 0) break;
}
if (*p != terminator)
{
*errorcodeptr = ERR57;
break;
}
if (isnumber)
{
ptr++;
goto HANDLE_NUMERICAL_RECURSION;
}
is_recurse = TRUE;
goto NAMED_REF_OR_RECURSE;
}
/* Test a signed number in angle brackets or quotes. */
p = ptr + 2;
while ((digitab[*p] & ctype_digit) != 0) p++;
if (*p != terminator)
{
*errorcodeptr = ERR57;
break;
}
ptr++;
goto HANDLE_NUMERICAL_RECURSION;
}
/* \k<name> or \k'name' is a back reference by name (Perl syntax).
We also support \k{name} (.NET syntax) */
@ -5595,14 +5753,14 @@ do {
if (!is_anchored(scode, options, bracket_map, backref_map)) return FALSE;
}
/* .* is not anchored unless DOTALL is set and it isn't in brackets that
are or may be referenced. */
/* .* is not anchored unless DOTALL is set (which generates OP_ALLANY) and
it isn't in brackets that are or may be referenced. */
else if ((op == OP_TYPESTAR || op == OP_TYPEMINSTAR ||
op == OP_TYPEPOSSTAR) &&
(*options & PCRE_DOTALL) != 0)
op == OP_TYPEPOSSTAR))
{
if (scode[1] != OP_ANY || (bracket_map & backref_map) != 0) return FALSE;
if (scode[1] != OP_ALLANY || (bracket_map & backref_map) != 0)
return FALSE;
}
/* Check for explicit anchoring */

View File

@ -1146,11 +1146,11 @@ for (;;)
do ecode += GET(ecode,1); while (*ecode == OP_ALT);
break;
/* BRAZERO and BRAMINZERO occur just before a bracket group, indicating
that it may occur zero times. It may repeat infinitely, or not at all -
i.e. it could be ()* or ()? in the pattern. Brackets with fixed upper
repeat limits are compiled as a number of copies, with the optional ones
preceded by BRAZERO or BRAMINZERO. */
/* BRAZERO, BRAMINZERO and SKIPZERO occur just before a bracket group,
indicating that it may occur zero times. It may repeat infinitely, or not
at all - i.e. it could be ()* or ()? or even (){0} in the pattern. Brackets
with fixed upper repeat limits are compiled as a number of copies, with the
optional ones preceded by BRAZERO or BRAMINZERO. */
case OP_BRAZERO:
{
@ -1172,6 +1172,14 @@ for (;;)
}
break;
case OP_SKIPZERO:
{
next = ecode+1;
do next += GET(next,1); while (*next == OP_ALT);
ecode = next + 1 + LINK_SIZE;
}
break;
/* End of a group, repeated or non-repeating. */
case OP_KET:
@ -1419,13 +1427,12 @@ for (;;)
/* Match a single character type; inline for speed */
case OP_ANY:
if ((ims & PCRE_DOTALL) == 0)
{
if (IS_NEWLINE(eptr)) RRETURN(MATCH_NOMATCH);
}
if (IS_NEWLINE(eptr)) RRETURN(MATCH_NOMATCH);
/* Fall through */
case OP_ALLANY:
if (eptr++ >= md->end_subject) RRETURN(MATCH_NOMATCH);
if (utf8)
while (eptr < md->end_subject && (*eptr & 0xc0) == 0x80) eptr++;
if (utf8) while (eptr < md->end_subject && (*eptr & 0xc0) == 0x80) eptr++;
ecode++;
break;
@ -1721,16 +1728,25 @@ for (;;)
case OP_REF:
{
offset = GET2(ecode, 1) << 1; /* Doubled ref number */
ecode += 3; /* Advance past item */
ecode += 3;
/* If the reference is unset, set the length to be longer than the amount
of subject left; this ensures that every attempt at a match fails. We
can't just fail here, because of the possibility of quantifiers with zero
minima. */
/* If the reference is unset, there are two possibilities:
length = (offset >= offset_top || md->offset_vector[offset] < 0)?
md->end_subject - eptr + 1 :
md->offset_vector[offset+1] - md->offset_vector[offset];
(a) In the default, Perl-compatible state, set the length to be longer
than the amount of subject left; this ensures that every attempt at a
match fails. We can't just fail here, because of the possibility of
quantifiers with zero minima.
(b) If the JavaScript compatibility flag is set, set the length to zero
so that the back reference matches an empty string.
Otherwise, set the length to the length of what was matched by the
referenced subpattern. */
if (offset >= offset_top || md->offset_vector[offset] < 0)
length = (md->jscript_compat)? 0 : md->end_subject - eptr + 1;
else
length = md->offset_vector[offset+1] - md->offset_vector[offset];
/* Set up for repetition, or handle the non-repeated case */
@ -2933,14 +2949,22 @@ for (;;)
case OP_ANY:
for (i = 1; i <= min; i++)
{
if (eptr >= md->end_subject ||
((ims & PCRE_DOTALL) == 0 && IS_NEWLINE(eptr)))
if (eptr >= md->end_subject || IS_NEWLINE(eptr))
RRETURN(MATCH_NOMATCH);
eptr++;
while (eptr < md->end_subject && (*eptr & 0xc0) == 0x80) eptr++;
}
break;
case OP_ALLANY:
for (i = 1; i <= min; i++)
{
if (eptr >= md->end_subject) RRETURN(MATCH_NOMATCH);
eptr++;
while (eptr < md->end_subject && (*eptr & 0xc0) == 0x80) eptr++;
}
break;
case OP_ANYBYTE:
eptr += min;
break;
@ -3149,15 +3173,15 @@ for (;;)
switch(ctype)
{
case OP_ANY:
if ((ims & PCRE_DOTALL) == 0)
for (i = 1; i <= min; i++)
{
for (i = 1; i <= min; i++)
{
if (IS_NEWLINE(eptr)) RRETURN(MATCH_NOMATCH);
eptr++;
}
if (IS_NEWLINE(eptr)) RRETURN(MATCH_NOMATCH);
eptr++;
}
else eptr += min;
break;
case OP_ALLANY:
eptr += min;
break;
case OP_ANYBYTE:
@ -3414,16 +3438,14 @@ for (;;)
RMATCH(eptr, ecode, offset_top, md, ims, eptrb, 0, RM42);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
if (fi >= max || eptr >= md->end_subject ||
(ctype == OP_ANY && (ims & PCRE_DOTALL) == 0 &&
IS_NEWLINE(eptr)))
(ctype == OP_ANY && IS_NEWLINE(eptr)))
RRETURN(MATCH_NOMATCH);
GETCHARINC(c, eptr);
switch(ctype)
{
case OP_ANY: /* This is the DOTALL case */
break;
case OP_ANY: /* This is the non-NL case */
case OP_ALLANY:
case OP_ANYBYTE:
break;
@ -3575,15 +3597,14 @@ for (;;)
RMATCH(eptr, ecode, offset_top, md, ims, eptrb, 0, RM43);
if (rrc != MATCH_NOMATCH) RRETURN(rrc);
if (fi >= max || eptr >= md->end_subject ||
((ims & PCRE_DOTALL) == 0 && IS_NEWLINE(eptr)))
(ctype == OP_ANY && IS_NEWLINE(eptr)))
RRETURN(MATCH_NOMATCH);
c = *eptr++;
switch(ctype)
{
case OP_ANY: /* This is the DOTALL case */
break;
case OP_ANY: /* This is the non-NL case */
case OP_ALLANY:
case OP_ANYBYTE:
break;
@ -3837,23 +3858,11 @@ for (;;)
case OP_ANY:
if (max < INT_MAX)
{
if ((ims & PCRE_DOTALL) == 0)
for (i = min; i < max; i++)
{
for (i = min; i < max; i++)
{
if (eptr >= md->end_subject || IS_NEWLINE(eptr)) break;
eptr++;
while (eptr < md->end_subject && (*eptr & 0xc0) == 0x80) eptr++;
}
}
else
{
for (i = min; i < max; i++)
{
if (eptr >= md->end_subject) break;
eptr++;
while (eptr < md->end_subject && (*eptr & 0xc0) == 0x80) eptr++;
}
if (eptr >= md->end_subject || IS_NEWLINE(eptr)) break;
eptr++;
while (eptr < md->end_subject && (*eptr & 0xc0) == 0x80) eptr++;
}
}
@ -3861,22 +3870,28 @@ for (;;)
else
{
if ((ims & PCRE_DOTALL) == 0)
for (i = min; i < max; i++)
{
for (i = min; i < max; i++)
{
if (eptr >= md->end_subject || IS_NEWLINE(eptr)) break;
eptr++;
while (eptr < md->end_subject && (*eptr & 0xc0) == 0x80) eptr++;
}
}
else
{
eptr = md->end_subject;
if (eptr >= md->end_subject || IS_NEWLINE(eptr)) break;
eptr++;
while (eptr < md->end_subject && (*eptr & 0xc0) == 0x80) eptr++;
}
}
break;
case OP_ALLANY:
if (max < INT_MAX)
{
for (i = min; i < max; i++)
{
if (eptr >= md->end_subject) break;
eptr++;
while (eptr < md->end_subject && (*eptr & 0xc0) == 0x80) eptr++;
}
}
else eptr = md->end_subject; /* Unlimited UTF-8 repeat */
break;
/* The byte case is the same as non-UTF8 */
case OP_ANYBYTE:
@ -4062,17 +4077,14 @@ for (;;)
switch(ctype)
{
case OP_ANY:
if ((ims & PCRE_DOTALL) == 0)
for (i = min; i < max; i++)
{
for (i = min; i < max; i++)
{
if (eptr >= md->end_subject || IS_NEWLINE(eptr)) break;
eptr++;
}
break;
if (eptr >= md->end_subject || IS_NEWLINE(eptr)) break;
eptr++;
}
/* For DOTALL case, fall through and treat as \C */
break;
case OP_ALLANY:
case OP_ANYBYTE:
c = max - min;
if (c > (unsigned int)(md->end_subject - eptr))
@ -4448,6 +4460,7 @@ end_subject = md->end_subject;
md->endonly = (re->options & PCRE_DOLLAR_ENDONLY) != 0;
utf8 = md->utf8 = (re->options & PCRE_UTF8) != 0;
md->jscript_compat = (re->options & PCRE_JAVASCRIPT_COMPAT) != 0;
md->notbol = (options & PCRE_NOTBOL) != 0;
md->noteol = (options & PCRE_NOTEOL) != 0;

View File

@ -514,7 +514,8 @@ time, run time, or study time, respectively. */
(PCRE_CASELESS|PCRE_EXTENDED|PCRE_ANCHORED|PCRE_MULTILINE| \
PCRE_DOTALL|PCRE_DOLLAR_ENDONLY|PCRE_EXTRA|PCRE_UNGREEDY|PCRE_UTF8| \
PCRE_NO_AUTO_CAPTURE|PCRE_NO_UTF8_CHECK|PCRE_AUTO_CALLOUT|PCRE_FIRSTLINE| \
PCRE_DUPNAMES|PCRE_NEWLINE_BITS|PCRE_BSR_ANYCRLF|PCRE_BSR_UNICODE)
PCRE_DUPNAMES|PCRE_NEWLINE_BITS|PCRE_BSR_ANYCRLF|PCRE_BSR_UNICODE| \
PCRE_JAVASCRIPT_COMPAT)
#define PUBLIC_EXEC_OPTIONS \
(PCRE_ANCHORED|PCRE_NOTBOL|PCRE_NOTEOL|PCRE_NOTEMPTY|PCRE_NO_UTF8_CHECK| \
@ -604,16 +605,20 @@ contain UTF-8 characters with values greater than 255. */
value such as \n. They must have non-zero values, as check_escape() returns
their negation. Also, they must appear in the same order as in the opcode
definitions below, up to ESC_z. There's a dummy for OP_ANY because it
corresponds to "." rather than an escape sequence. The final one must be
ESC_REF as subsequent values are used for backreferences (\1, \2, \3, etc).
There are two tests in the code for an escape greater than ESC_b and less than
ESC_Z to detect the types that may be repeated. These are the types that
consume characters. If any new escapes are put in between that don't consume a
character, that code will have to change. */
corresponds to "." rather than an escape sequence, and another for OP_ALLANY
(which is used for [^] in JavaScript compatibility mode).
The final escape must be ESC_REF as subsequent values are used for
backreferences (\1, \2, \3, etc). There are two tests in the code for an escape
greater than ESC_b and less than ESC_Z to detect the types that may be
repeated. These are the types that consume characters. If any new escapes are
put in between that don't consume a character, that code will have to change.
*/
enum { ESC_A = 1, ESC_G, ESC_K, ESC_B, ESC_b, ESC_D, ESC_d, ESC_S, ESC_s,
ESC_W, ESC_w, ESC_dum1, ESC_C, ESC_P, ESC_p, ESC_R, ESC_H, ESC_h,
ESC_V, ESC_v, ESC_X, ESC_Z, ESC_z, ESC_E, ESC_Q, ESC_k, ESC_REF };
ESC_W, ESC_w, ESC_dum1, ESC_dum2, ESC_C, ESC_P, ESC_p, ESC_R, ESC_H,
ESC_h, ESC_V, ESC_v, ESC_X, ESC_Z, ESC_z, ESC_E, ESC_Q, ESC_g, ESC_k,
ESC_REF };
/* Opcode table: Starting from 1 (i.e. after OP_END), the values up to
@ -639,141 +644,146 @@ enum {
OP_WHITESPACE, /* 9 \s */
OP_NOT_WORDCHAR, /* 10 \W */
OP_WORDCHAR, /* 11 \w */
OP_ANY, /* 12 Match any character */
OP_ANYBYTE, /* 13 Match any byte (\C); different to OP_ANY for UTF-8 */
OP_NOTPROP, /* 14 \P (not Unicode property) */
OP_PROP, /* 15 \p (Unicode property) */
OP_ANYNL, /* 16 \R (any newline sequence) */
OP_NOT_HSPACE, /* 17 \H (not horizontal whitespace) */
OP_HSPACE, /* 18 \h (horizontal whitespace) */
OP_NOT_VSPACE, /* 19 \V (not vertical whitespace) */
OP_VSPACE, /* 20 \v (vertical whitespace) */
OP_EXTUNI, /* 21 \X (extended Unicode sequence */
OP_EODN, /* 22 End of data or \n at end of data: \Z. */
OP_EOD, /* 23 End of data: \z */
OP_ANY, /* 12 Match any character (subject to DOTALL) */
OP_ALLANY, /* 13 Match any character (not subject to DOTALL) */
OP_ANYBYTE, /* 14 Match any byte (\C); different to OP_ANY for UTF-8 */
OP_NOTPROP, /* 15 \P (not Unicode property) */
OP_PROP, /* 16 \p (Unicode property) */
OP_ANYNL, /* 17 \R (any newline sequence) */
OP_NOT_HSPACE, /* 18 \H (not horizontal whitespace) */
OP_HSPACE, /* 19 \h (horizontal whitespace) */
OP_NOT_VSPACE, /* 20 \V (not vertical whitespace) */
OP_VSPACE, /* 21 \v (vertical whitespace) */
OP_EXTUNI, /* 22 \X (extended Unicode sequence */
OP_EODN, /* 23 End of data or \n at end of data: \Z. */
OP_EOD, /* 24 End of data: \z */
OP_OPT, /* 24 Set runtime options */
OP_CIRC, /* 25 Start of line - varies with multiline switch */
OP_DOLL, /* 26 End of line - varies with multiline switch */
OP_CHAR, /* 27 Match one character, casefully */
OP_CHARNC, /* 28 Match one character, caselessly */
OP_NOT, /* 29 Match one character, not the following one */
OP_OPT, /* 25 Set runtime options */
OP_CIRC, /* 26 Start of line - varies with multiline switch */
OP_DOLL, /* 27 End of line - varies with multiline switch */
OP_CHAR, /* 28 Match one character, casefully */
OP_CHARNC, /* 29 Match one character, caselessly */
OP_NOT, /* 30 Match one character, not the following one */
OP_STAR, /* 30 The maximizing and minimizing versions of */
OP_MINSTAR, /* 31 these six opcodes must come in pairs, with */
OP_PLUS, /* 32 the minimizing one second. */
OP_MINPLUS, /* 33 This first set applies to single characters.*/
OP_QUERY, /* 34 */
OP_MINQUERY, /* 35 */
OP_STAR, /* 31 The maximizing and minimizing versions of */
OP_MINSTAR, /* 32 these six opcodes must come in pairs, with */
OP_PLUS, /* 33 the minimizing one second. */
OP_MINPLUS, /* 34 This first set applies to single characters.*/
OP_QUERY, /* 35 */
OP_MINQUERY, /* 36 */
OP_UPTO, /* 36 From 0 to n matches */
OP_MINUPTO, /* 37 */
OP_EXACT, /* 38 Exactly n matches */
OP_UPTO, /* 37 From 0 to n matches */
OP_MINUPTO, /* 38 */
OP_EXACT, /* 39 Exactly n matches */
OP_POSSTAR, /* 39 Possessified star */
OP_POSPLUS, /* 40 Possessified plus */
OP_POSQUERY, /* 41 Posesssified query */
OP_POSUPTO, /* 42 Possessified upto */
OP_POSSTAR, /* 40 Possessified star */
OP_POSPLUS, /* 41 Possessified plus */
OP_POSQUERY, /* 42 Posesssified query */
OP_POSUPTO, /* 43 Possessified upto */
OP_NOTSTAR, /* 43 The maximizing and minimizing versions of */
OP_NOTMINSTAR, /* 44 these six opcodes must come in pairs, with */
OP_NOTPLUS, /* 45 the minimizing one second. They must be in */
OP_NOTMINPLUS, /* 46 exactly the same order as those above. */
OP_NOTQUERY, /* 47 This set applies to "not" single characters. */
OP_NOTMINQUERY, /* 48 */
OP_NOTSTAR, /* 44 The maximizing and minimizing versions of */
OP_NOTMINSTAR, /* 45 these six opcodes must come in pairs, with */
OP_NOTPLUS, /* 46 the minimizing one second. They must be in */
OP_NOTMINPLUS, /* 47 exactly the same order as those above. */
OP_NOTQUERY, /* 48 This set applies to "not" single characters. */
OP_NOTMINQUERY, /* 49 */
OP_NOTUPTO, /* 49 From 0 to n matches */
OP_NOTMINUPTO, /* 50 */
OP_NOTEXACT, /* 51 Exactly n matches */
OP_NOTUPTO, /* 50 From 0 to n matches */
OP_NOTMINUPTO, /* 51 */
OP_NOTEXACT, /* 52 Exactly n matches */
OP_NOTPOSSTAR, /* 52 Possessified versions */
OP_NOTPOSPLUS, /* 53 */
OP_NOTPOSQUERY, /* 54 */
OP_NOTPOSUPTO, /* 55 */
OP_NOTPOSSTAR, /* 53 Possessified versions */
OP_NOTPOSPLUS, /* 54 */
OP_NOTPOSQUERY, /* 55 */
OP_NOTPOSUPTO, /* 56 */
OP_TYPESTAR, /* 56 The maximizing and minimizing versions of */
OP_TYPEMINSTAR, /* 57 these six opcodes must come in pairs, with */
OP_TYPEPLUS, /* 58 the minimizing one second. These codes must */
OP_TYPEMINPLUS, /* 59 be in exactly the same order as those above. */
OP_TYPEQUERY, /* 60 This set applies to character types such as \d */
OP_TYPEMINQUERY, /* 61 */
OP_TYPESTAR, /* 57 The maximizing and minimizing versions of */
OP_TYPEMINSTAR, /* 58 these six opcodes must come in pairs, with */
OP_TYPEPLUS, /* 59 the minimizing one second. These codes must */
OP_TYPEMINPLUS, /* 60 be in exactly the same order as those above. */
OP_TYPEQUERY, /* 61 This set applies to character types such as \d */
OP_TYPEMINQUERY, /* 62 */
OP_TYPEUPTO, /* 62 From 0 to n matches */
OP_TYPEMINUPTO, /* 63 */
OP_TYPEEXACT, /* 64 Exactly n matches */
OP_TYPEUPTO, /* 63 From 0 to n matches */
OP_TYPEMINUPTO, /* 64 */
OP_TYPEEXACT, /* 65 Exactly n matches */
OP_TYPEPOSSTAR, /* 65 Possessified versions */
OP_TYPEPOSPLUS, /* 66 */
OP_TYPEPOSQUERY, /* 67 */
OP_TYPEPOSUPTO, /* 68 */
OP_TYPEPOSSTAR, /* 66 Possessified versions */
OP_TYPEPOSPLUS, /* 67 */
OP_TYPEPOSQUERY, /* 68 */
OP_TYPEPOSUPTO, /* 69 */
OP_CRSTAR, /* 69 The maximizing and minimizing versions of */
OP_CRMINSTAR, /* 70 all these opcodes must come in pairs, with */
OP_CRPLUS, /* 71 the minimizing one second. These codes must */
OP_CRMINPLUS, /* 72 be in exactly the same order as those above. */
OP_CRQUERY, /* 73 These are for character classes and back refs */
OP_CRMINQUERY, /* 74 */
OP_CRRANGE, /* 75 These are different to the three sets above. */
OP_CRMINRANGE, /* 76 */
OP_CRSTAR, /* 70 The maximizing and minimizing versions of */
OP_CRMINSTAR, /* 71 all these opcodes must come in pairs, with */
OP_CRPLUS, /* 72 the minimizing one second. These codes must */
OP_CRMINPLUS, /* 73 be in exactly the same order as those above. */
OP_CRQUERY, /* 74 These are for character classes and back refs */
OP_CRMINQUERY, /* 75 */
OP_CRRANGE, /* 76 These are different to the three sets above. */
OP_CRMINRANGE, /* 77 */
OP_CLASS, /* 77 Match a character class, chars < 256 only */
OP_NCLASS, /* 78 Same, but the bitmap was created from a negative
OP_CLASS, /* 78 Match a character class, chars < 256 only */
OP_NCLASS, /* 79 Same, but the bitmap was created from a negative
class - the difference is relevant only when a UTF-8
character > 255 is encountered. */
OP_XCLASS, /* 79 Extended class for handling UTF-8 chars within the
OP_XCLASS, /* 80 Extended class for handling UTF-8 chars within the
class. This does both positive and negative. */
OP_REF, /* 80 Match a back reference */
OP_RECURSE, /* 81 Match a numbered subpattern (possibly recursive) */
OP_CALLOUT, /* 82 Call out to external function if provided */
OP_REF, /* 81 Match a back reference */
OP_RECURSE, /* 82 Match a numbered subpattern (possibly recursive) */
OP_CALLOUT, /* 83 Call out to external function if provided */
OP_ALT, /* 83 Start of alternation */
OP_KET, /* 84 End of group that doesn't have an unbounded repeat */
OP_KETRMAX, /* 85 These two must remain together and in this */
OP_KETRMIN, /* 86 order. They are for groups the repeat for ever. */
OP_ALT, /* 84 Start of alternation */
OP_KET, /* 85 End of group that doesn't have an unbounded repeat */
OP_KETRMAX, /* 86 These two must remain together and in this */
OP_KETRMIN, /* 87 order. They are for groups the repeat for ever. */
/* The assertions must come before BRA, CBRA, ONCE, and COND.*/
OP_ASSERT, /* 87 Positive lookahead */
OP_ASSERT_NOT, /* 88 Negative lookahead */
OP_ASSERTBACK, /* 89 Positive lookbehind */
OP_ASSERTBACK_NOT, /* 90 Negative lookbehind */
OP_REVERSE, /* 91 Move pointer back - used in lookbehind assertions */
OP_ASSERT, /* 88 Positive lookahead */
OP_ASSERT_NOT, /* 89 Negative lookahead */
OP_ASSERTBACK, /* 90 Positive lookbehind */
OP_ASSERTBACK_NOT, /* 91 Negative lookbehind */
OP_REVERSE, /* 92 Move pointer back - used in lookbehind assertions */
/* ONCE, BRA, CBRA, and COND must come after the assertions, with ONCE first,
as there's a test for >= ONCE for a subpattern that isn't an assertion. */
OP_ONCE, /* 92 Atomic group */
OP_BRA, /* 93 Start of non-capturing bracket */
OP_CBRA, /* 94 Start of capturing bracket */
OP_COND, /* 95 Conditional group */
OP_ONCE, /* 93 Atomic group */
OP_BRA, /* 94 Start of non-capturing bracket */
OP_CBRA, /* 95 Start of capturing bracket */
OP_COND, /* 96 Conditional group */
/* These three must follow the previous three, in the same order. There's a
check for >= SBRA to distinguish the two sets. */
OP_SBRA, /* 96 Start of non-capturing bracket, check empty */
OP_SCBRA, /* 97 Start of capturing bracket, check empty */
OP_SCOND, /* 98 Conditional group, check empty */
OP_SBRA, /* 97 Start of non-capturing bracket, check empty */
OP_SCBRA, /* 98 Start of capturing bracket, check empty */
OP_SCOND, /* 99 Conditional group, check empty */
OP_CREF, /* 99 Used to hold a capture number as condition */
OP_RREF, /* 100 Used to hold a recursion number as condition */
OP_DEF, /* 101 The DEFINE condition */
OP_CREF, /* 100 Used to hold a capture number as condition */
OP_RREF, /* 101 Used to hold a recursion number as condition */
OP_DEF, /* 102 The DEFINE condition */
OP_BRAZERO, /* 102 These two must remain together and in this */
OP_BRAMINZERO, /* 103 order. */
OP_BRAZERO, /* 103 These two must remain together and in this */
OP_BRAMINZERO, /* 104 order. */
/* These are backtracking control verbs */
OP_PRUNE, /* 104 */
OP_SKIP, /* 105 */
OP_THEN, /* 106 */
OP_COMMIT, /* 107 */
OP_PRUNE, /* 105 */
OP_SKIP, /* 106 */
OP_THEN, /* 107 */
OP_COMMIT, /* 108 */
/* These are forced failure and success verbs */
OP_FAIL, /* 108 */
OP_ACCEPT /* 109 */
OP_FAIL, /* 109 */
OP_ACCEPT, /* 110 */
/* This is used to skip a subpattern with a {0} quantifier */
OP_SKIPZERO /* 111 */
};
@ -782,7 +792,7 @@ for debugging. The macro is referenced only in pcre_printint.c. */
#define OP_NAME_LIST \
"End", "\\A", "\\G", "\\K", "\\B", "\\b", "\\D", "\\d", \
"\\S", "\\s", "\\W", "\\w", "Any", "Anybyte", \
"\\S", "\\s", "\\W", "\\w", "Any", "AllAny", "Anybyte", \
"notprop", "prop", "\\R", "\\H", "\\h", "\\V", "\\v", \
"extuni", "\\Z", "\\z", \
"Opt", "^", "$", "char", "charnc", "not", \
@ -798,7 +808,8 @@ for debugging. The macro is referenced only in pcre_printint.c. */
"AssertB", "AssertB not", "Reverse", \
"Once", "Bra", "CBra", "Cond", "SBra", "SCBra", "SCond", \
"Cond ref", "Cond rec", "Cond def", "Brazero", "Braminzero", \
"*PRUNE", "*SKIP", "*THEN", "*COMMIT", "*FAIL", "*ACCEPT"
"*PRUNE", "*SKIP", "*THEN", "*COMMIT", "*FAIL", "*ACCEPT", \
"Skip zero"
/* This macro defines the length of fixed length operations in the compiled
@ -814,7 +825,7 @@ in UTF-8 mode. The code that uses this table must know about such things. */
1, /* End */ \
1, 1, 1, 1, 1, /* \A, \G, \K, \B, \b */ \
1, 1, 1, 1, 1, 1, /* \D, \d, \S, \s, \W, \w */ \
1, 1, /* Any, Anybyte */ \
1, 1, 1, /* Any, AllAny, Anybyte */ \
3, 3, 1, /* NOTPROP, PROP, EXTUNI */ \
1, 1, 1, 1, 1, /* \R, \H, \h, \V, \v */ \
1, 1, 2, 1, 1, /* \Z, \z, Opt, ^, $ */ \
@ -863,7 +874,7 @@ in UTF-8 mode. The code that uses this table must know about such things. */
1, /* DEF */ \
1, 1, /* BRAZERO, BRAMINZERO */ \
1, 1, 1, 1, /* PRUNE, SKIP, THEN, COMMIT, */ \
1, 1 /* FAIL, ACCEPT */
1, 1, 1 /* FAIL, ACCEPT, SKIPZERO */
/* A magic value for OP_RREF to indicate the "any recursion" condition. */
@ -879,7 +890,7 @@ enum { ERR0, ERR1, ERR2, ERR3, ERR4, ERR5, ERR6, ERR7, ERR8, ERR9,
ERR30, ERR31, ERR32, ERR33, ERR34, ERR35, ERR36, ERR37, ERR38, ERR39,
ERR40, ERR41, ERR42, ERR43, ERR44, ERR45, ERR46, ERR47, ERR48, ERR49,
ERR50, ERR51, ERR52, ERR53, ERR54, ERR55, ERR56, ERR57, ERR58, ERR59,
ERR60, ERR61, ERR62, ERR63 };
ERR60, ERR61, ERR62, ERR63, ERR64 };
/* The real format of the start of the pcre block; the index of names and the
code vector run on as long as necessary after the end. We store an explicit
@ -1004,6 +1015,7 @@ typedef struct match_data {
BOOL notbol; /* NOTBOL flag */
BOOL noteol; /* NOTEOL flag */
BOOL utf8; /* UTF8 flag */
BOOL jscript_compat; /* JAVASCRIPT_COMPAT flag */
BOOL endonly; /* Dollar not before final \n */
BOOL notempty; /* Empty string match not wanted */
BOOL partial; /* PARTIAL flag */

View File

@ -215,6 +215,13 @@ do
tcode += 1 + LINK_SIZE;
break;
/* SKIPZERO skips the bracket. */
case OP_SKIPZERO:
do tcode += GET(tcode,1); while (*tcode == OP_ALT);
tcode += 1 + LINK_SIZE;
break;
/* Single-char * or ? sets the bit and tries the next item */
case OP_STAR:
@ -339,6 +346,7 @@ do
switch(tcode[1])
{
case OP_ANY:
case OP_ALLANY:
return SSB_FAIL;
case OP_NOT_DIGIT:

View File

@ -124,7 +124,8 @@ static const int eint[] = {
REG_BADPAT, /* (?+ or (?- must be followed by a non-zero number */
REG_BADPAT, /* number is too big */
REG_BADPAT, /* subpattern name expected */
REG_BADPAT /* digit expected after (?+ */
REG_BADPAT, /* digit expected after (?+ */
REG_BADPAT /* ] is an invalid data character in JavaScript compatibility mode */
};
/* Table of texts corresponding to POSIX error codes */
@ -261,7 +262,7 @@ PCREPOSIX_EXP_DEFN int
regexec(const regex_t *preg, const char *string, size_t nmatch,
regmatch_t pmatch[], int eflags)
{
int rc;
int rc, so, eo;
int options = 0;
int *ovector = NULL;
int small_ovector[POSIX_MALLOC_THRESHOLD * 3];
@ -294,7 +295,23 @@ else if (nmatch > 0)
}
}
rc = pcre_exec((const pcre *)preg->re_pcre, NULL, string, (int)strlen(string),
/* REG_STARTEND is a BSD extension, to allow for non-NUL-terminated strings.
The man page from OS X says "REG_STARTEND affects only the location of the
string, not how it is matched". That is why the "so" value is used to bump the
start location rather than being passed as a PCRE "starting offset". */
if ((eflags & REG_STARTEND) != 0)
{
so = pmatch[0].rm_so;
eo = pmatch[0].rm_eo;
}
else
{
so = 0;
eo = strlen(string);
}
rc = pcre_exec((const pcre *)preg->re_pcre, NULL, string + so, (eo - so),
0, options, ovector, nmatch * 3);
if (rc == 0) rc = nmatch; /* All captured slots were filled in */

View File

@ -59,6 +59,7 @@ extern "C" {
#define REG_DOTALL 0x0010 /* NOT defined by POSIX. */
#define REG_NOSUB 0x0020
#define REG_UTF8 0x0040 /* NOT defined by POSIX. */
#define REG_STARTEND 0x0080 /* BSD feature: pass subject string by so,eo */
/* This is not used by PCRE, but by defining it we make it easier
to slot PCRE into existing programs that make POSIX calls. */

View File

@ -2589,4 +2589,139 @@ a random value. /Ix
/[[:a\dz:]]/
/^(?<name>a|b\g<name>c)/
aaaa
bacxxx
bbaccxxx
bbbacccxx
/^(?<name>a|b\g'name'c)/
aaaa
bacxxx
bbaccxxx
bbbacccxx
/^(a|b\g<1>c)/
aaaa
bacxxx
bbaccxxx
bbbacccxx
/^(a|b\g'1'c)/
aaaa
bacxxx
bbaccxxx
bbbacccxx
/^(a|b\g'-1'c)/
aaaa
bacxxx
bbaccxxx
bbbacccxx
/(^(a|b\g<-1>c))/
aaaa
bacxxx
bbaccxxx
bbbacccxx
/(^(a|b\g<-1'c))/
/(^(a|b\g{-1}))/
bacxxx
/(?-i:\g<name>)(?i:(?<name>a))/
XaaX
XAAX
/(?i:\g<name>)(?-i:(?<name>a))/
XaaX
** Failers
XAAX
/(?-i:\g<+1>)(?i:(a))/
XaaX
XAAX
/(?=(?<regex>(?#simplesyntax)\$(?<name>[a-zA-Z_\x{7f}-\x{ff}][a-zA-Z0-9_\x{7f}-\x{ff}]*)(?:\[(?<index>[a-zA-Z0-9_\x{7f}-\x{ff}]+|\$\g<name>)\]|->\g<name>(\(.*?\))?)?|(?#simple syntax withbraces)\$\{(?:\g<name>(?<indices>\[(?:\g<index>|'(?:\\.|[^'\\])*'|"(?:\g<regex>|\\.|[^"\\])*")\])?|\g<complex>|\$\{\g<complex>\})\}|(?#complexsyntax)\{(?<complex>\$(?<segment>\g<name>(\g<indices>*|\(.*?\))?)(?:->\g<segment>)*|\$\g<complex>|\$\{\g<complex>\})\}))\{/
/(?<n>a|b|c)\g<n>*/
abc
accccbbb
/^(?+1)(?<a>x|y){0}z/
xzxx
yzyy
** Failers
xxz
/(\3)(\1)(a)/
cat
/(\3)(\1)(a)/<JS>
cat
/TA]/
The ACTA] comes
/TA]/<JS>
The ACTA] comes
/(?2)[]a()b](abc)/
abcbabc
/(?2)[^]a()b](abc)/
abcbabc
/(?1)[]a()b](abc)/
abcbabc
** Failers
abcXabc
/(?1)[^]a()b](abc)/
abcXabc
** Failers
abcbabc
/(?2)[]a()b](abc)(xyz)/
xyzbabcxyz
/(?&N)[]a(?<N>)](?<M>abc)/
abc<abc
/(?&N)[]a(?<N>)](abc)/
abc<abc
/a[]b/
/a[^]b/
/a[]b/<JS>
** Failers
ab
/a[]+b/<JS>
** Failers
ab
/a[]*+b/<JS>
** Failers
ab
/a[^]b/<JS>
aXb
a\nb
** Failers
ab
/a[^]+b/<JS>
aXb
a\nX\nXb
** Failers
ab
/a(?!)+b/
/a(*FAIL)+b/
/ End of testinput2 /

View File

@ -461,4 +461,16 @@ can't tell the difference.) --/
/[[:a\x{100}b:]]/8
/a[^]b/<JS>8
a\x{1234}b
a\nb
** Failers
ab
/a[^]+b/<JS>8
aXb
a\nX\nX\x{1234}b
** Failers
ab
/ End of testinput5 /

View File

@ -4364,5 +4364,32 @@
a\r\r\r\r\rb
a\x85\85b\<bsr_anycrlf>
a\x0b\0bb\<bsr_anycrlf>
/a(?!)|\wbc/
abc
/a[]b/<JS>
** Failers
ab
/a[]+b/<JS>
** Failers
ab
/a[]*+b/<JS>
** Failers
ab
/a[^]b/<JS>
aXb
a\nb
** Failers
ab
/a[^]+b/<JS>
aXb
a\nX\nXb
** Failers
ab
/ End of testinput7 /

View File

@ -21,7 +21,7 @@ Memory allocation (code space): 25
------------------------------------------------------------------
0 21 Bra
3 9 CBra 1
8 Any*
8 AllAny*
10 X
12 6 Alt
15 ^
@ -37,7 +37,7 @@ Memory allocation (code space): 29
0 25 Bra
3 9 Bra
6 04 Opt
8 Any*
8 AllAny*
10 X
12 8 Alt
15 04 Opt

View File

@ -1126,7 +1126,7 @@ Need char = 'X'
/.*X/IDZs
------------------------------------------------------------------
Bra
Any*
AllAny*
X
Ket
End
@ -1160,7 +1160,7 @@ No need char
------------------------------------------------------------------
Bra
CBra 1
Any*
AllAny*
X
Alt
^
@ -1179,7 +1179,7 @@ No need char
------------------------------------------------------------------
Bra
CBra 1
Any*
AllAny*
X
Alt
^
@ -1199,7 +1199,7 @@ No need char
Bra
Bra
04 Opt
Any*
AllAny*
X
Alt
04 Opt
@ -1212,8 +1212,8 @@ No need char
------------------------------------------------------------------
Capturing subpattern count = 0
Partial matching not supported
No options
First char at start or follows newline
Options: anchored
No first char
No need char
/\Biss\B/I+
@ -8074,13 +8074,13 @@ No match
Failed: reference to non-existent subpattern at offset 7
/^(a)\g/
Failed: \g is not followed by a braced name or an optionally braced non-zero number at offset 5
Failed: a numbered reference must not be zero at offset 5
/^(a)\g{0}/
Failed: \g is not followed by a braced name or an optionally braced non-zero number at offset 7
Failed: a numbered reference must not be zero at offset 8
/^(a)\g{3/
Failed: \g is not followed by a braced name or an optionally braced non-zero number at offset 8
Failed: \g is not followed by a braced, angle-bracketed, or quoted name/number or by a plain number at offset 8
/^(a)\g{4a}/
Failed: reference to non-existent subpattern at offset 9
@ -8217,13 +8217,13 @@ No match
No match
/x(?-0)y/
Failed: (?+ or (?- or (?(+ or (?(- must be followed by a non-zero number at offset 5
Failed: a numbered reference must not be zero at offset 5
/x(?-1)y/
Failed: reference to non-existent subpattern at offset 5
/x(?+0)y/
Failed: (?+ or (?- or (?(+ or (?(- must be followed by a non-zero number at offset 5
Failed: a numbered reference must not be zero at offset 5
/x(?+1)y/
Failed: reference to non-existent subpattern at offset 5
@ -9385,4 +9385,250 @@ Failed: unknown POSIX class name at offset 6
/[[:a\dz:]]/
Failed: unknown POSIX class name at offset 3
/^(?<name>a|b\g<name>c)/
aaaa
0: a
1: a
bacxxx
0: bac
1: bac
bbaccxxx
0: bbacc
1: bbacc
bbbacccxx
0: bbbaccc
1: bbbaccc
/^(?<name>a|b\g'name'c)/
aaaa
0: a
1: a
bacxxx
0: bac
1: bac
bbaccxxx
0: bbacc
1: bbacc
bbbacccxx
0: bbbaccc
1: bbbaccc
/^(a|b\g<1>c)/
aaaa
0: a
1: a
bacxxx
0: bac
1: bac
bbaccxxx
0: bbacc
1: bbacc
bbbacccxx
0: bbbaccc
1: bbbaccc
/^(a|b\g'1'c)/
aaaa
0: a
1: a
bacxxx
0: bac
1: bac
bbaccxxx
0: bbacc
1: bbacc
bbbacccxx
0: bbbaccc
1: bbbaccc
/^(a|b\g'-1'c)/
aaaa
0: a
1: a
bacxxx
0: bac
1: bac
bbaccxxx
0: bbacc
1: bbacc
bbbacccxx
0: bbbaccc
1: bbbaccc
/(^(a|b\g<-1>c))/
aaaa
0: a
1: a
2: a
bacxxx
0: bac
1: bac
2: bac
bbaccxxx
0: bbacc
1: bbacc
2: bbacc
bbbacccxx
0: bbbaccc
1: bbbaccc
2: bbbaccc
/(^(a|b\g<-1'c))/
Failed: \g is not followed by a braced, angle-bracketed, or quoted name/number or by a plain number at offset 15
/(^(a|b\g{-1}))/
bacxxx
No match
/(?-i:\g<name>)(?i:(?<name>a))/
XaaX
0: aa
1: a
XAAX
0: AA
1: A
/(?i:\g<name>)(?-i:(?<name>a))/
XaaX
0: aa
1: a
** Failers
No match
XAAX
No match
/(?-i:\g<+1>)(?i:(a))/
XaaX
0: aa
1: a
XAAX
0: AA
1: A
/(?=(?<regex>(?#simplesyntax)\$(?<name>[a-zA-Z_\x{7f}-\x{ff}][a-zA-Z0-9_\x{7f}-\x{ff}]*)(?:\[(?<index>[a-zA-Z0-9_\x{7f}-\x{ff}]+|\$\g<name>)\]|->\g<name>(\(.*?\))?)?|(?#simple syntax withbraces)\$\{(?:\g<name>(?<indices>\[(?:\g<index>|'(?:\\.|[^'\\])*'|"(?:\g<regex>|\\.|[^"\\])*")\])?|\g<complex>|\$\{\g<complex>\})\}|(?#complexsyntax)\{(?<complex>\$(?<segment>\g<name>(\g<indices>*|\(.*?\))?)(?:->\g<segment>)*|\$\g<complex>|\$\{\g<complex>\})\}))\{/
/(?<n>a|b|c)\g<n>*/
abc
0: abc
1: a
accccbbb
0: accccbbb
1: a
/^(?+1)(?<a>x|y){0}z/
xzxx
0: xz
1: <unset>
yzyy
0: yz
1: <unset>
** Failers
No match
xxz
No match
/(\3)(\1)(a)/
cat
No match
/(\3)(\1)(a)/<JS>
cat
0: a
1:
2:
3: a
/TA]/
The ACTA] comes
0: TA]
/TA]/<JS>
Failed: ] is an invalid data character in JavaScript compatibility mode at offset 2
/(?2)[]a()b](abc)/
Failed: reference to non-existent subpattern at offset 3
/(?2)[^]a()b](abc)/
Failed: reference to non-existent subpattern at offset 3
/(?1)[]a()b](abc)/
abcbabc
0: abcbabc
1: abc
** Failers
No match
abcXabc
No match
/(?1)[^]a()b](abc)/
abcXabc
0: abcXabc
1: abc
** Failers
No match
abcbabc
No match
/(?2)[]a()b](abc)(xyz)/
xyzbabcxyz
0: xyzbabcxyz
1: abc
2: xyz
/(?&N)[]a(?<N>)](?<M>abc)/
Failed: reference to non-existent subpattern at offset 4
/(?&N)[]a(?<N>)](abc)/
Failed: reference to non-existent subpattern at offset 4
/a[]b/
Failed: missing terminating ] for character class at offset 4
/a[^]b/
Failed: missing terminating ] for character class at offset 5
/a[]b/<JS>
** Failers
No match
ab
No match
/a[]+b/<JS>
** Failers
No match
ab
No match
/a[]*+b/<JS>
** Failers
No match
ab
No match
/a[^]b/<JS>
aXb
0: aXb
a\nb
0: a\x0ab
** Failers
No match
ab
No match
/a[^]+b/<JS>
aXb
0: aXb
a\nX\nXb
0: a\x0aX\x0aXb
** Failers
No match
ab
No match
/a(?!)+b/
Failed: nothing to repeat at offset 5
/a(*FAIL)+b/
Failed: nothing to repeat at offset 8
/ End of testinput2 /

View File

@ -1608,4 +1608,24 @@ No match
/[[:a\x{100}b:]]/8
Failed: unknown POSIX class name at offset 3
/a[^]b/<JS>8
a\x{1234}b
0: a\x{1234}b
a\nb
0: a\x{0a}b
** Failers
No match
ab
No match
/a[^]+b/<JS>8
aXb
0: aXb
a\nX\nX\x{1234}b
0: a\x{0a}X\x{0a}X\x{1234}b
** Failers
No match
ab
No match
/ End of testinput5 /

View File

@ -7211,5 +7211,47 @@ No match
No match
a\x0b\0bb\<bsr_anycrlf>
No match
/a(?!)|\wbc/
abc
0: abc
/a[]b/<JS>
** Failers
No match
ab
No match
/a[]+b/<JS>
** Failers
No match
ab
No match
/a[]*+b/<JS>
** Failers
No match
ab
No match
/a[^]b/<JS>
aXb
0: aXb
a\nb
0: a\x0ab
** Failers
No match
ab
No match
/a[^]+b/<JS>
aXb
0: aXb
a\nX\nXb
0: a\x0aX\x0aXb
** Failers
No match
ab
No match
/ End of testinput7 /

View File

@ -17,7 +17,7 @@ typedef struct cnode {
#define f0_scriptmask 0xff000000 /* Mask for script field */
#define f0_scriptshift 24 /* Shift for script value */
#define f0_rangeflag 0x00f00000 /* Flag for a range item */
#define f0_rangeflag 0x00800000 /* Flag for a range item */
#define f0_charmask 0x001fffff /* Mask for code point value */
/* Things for the f1 field */