mirror of
https://github.com/php/php-src.git
synced 2024-10-19 15:34:25 +00:00
413 lines
12 KiB
Plaintext
413 lines
12 KiB
Plaintext
Oniguruma Regular Expressions Version 4.3.0 2006/08/17
|
|
|
|
syntax: ONIG_SYNTAX_RUBY (default)
|
|
|
|
|
|
1. Syntax elements
|
|
|
|
\ escape (enable or disable meta character meaning)
|
|
| alternation
|
|
(...) group
|
|
[...] character class
|
|
|
|
|
|
2. Characters
|
|
|
|
\t horizontal tab (0x09)
|
|
\v vertical tab (0x0B)
|
|
\n newline (0x0A)
|
|
\r return (0x0D)
|
|
\b back space (0x08)
|
|
\f form feed (0x0C)
|
|
\a bell (0x07)
|
|
\e escape (0x1B)
|
|
\nnn octal char (encoded byte value)
|
|
\xHH hexadecimal char (encoded byte value)
|
|
\x{7HHHHHHH} wide hexadecimal char (character code point value)
|
|
\cx control char (character code point value)
|
|
\C-x control char (character code point value)
|
|
\M-x meta (x|0x80) (character code point value)
|
|
\M-\C-x meta control char (character code point value)
|
|
|
|
(* \b is effective in character class [...] only)
|
|
|
|
|
|
3. Character types
|
|
|
|
. any character (except newline)
|
|
|
|
\w word character
|
|
|
|
Not Unicode:
|
|
alphanumeric, "_" and multibyte char.
|
|
|
|
Unicode:
|
|
General_Category -- (Letter|Mark|Number|Connector_Punctuation)
|
|
|
|
\W non word char
|
|
|
|
\s whitespace char
|
|
|
|
Not Unicode:
|
|
\t, \n, \v, \f, \r, \x20
|
|
|
|
Unicode:
|
|
0009, 000A, 000B, 000C, 000D, 0085(NEL),
|
|
General_Category -- Line_Separator
|
|
-- Paragraph_Separator
|
|
-- Space_Separator
|
|
|
|
\S non whitespace char
|
|
|
|
\d decimal digit char
|
|
|
|
Unicode: General_Category -- Decimal_Number
|
|
|
|
\D non decimal digit char
|
|
|
|
\h hexadecimal digit char [0-9a-fA-F]
|
|
|
|
\H non hexadecimal digit char
|
|
|
|
|
|
4. Quantifier
|
|
|
|
greedy
|
|
|
|
? 1 or 0 times
|
|
* 0 or more times
|
|
+ 1 or more times
|
|
{n,m} at least n but not more than m times
|
|
{n,} at least n times
|
|
{,n} at least 0 but not more than n times ({0,n})
|
|
{n} n times
|
|
|
|
reluctant
|
|
|
|
?? 1 or 0 times
|
|
*? 0 or more times
|
|
+? 1 or more times
|
|
{n,m}? at least n but not more than m times
|
|
{n,}? at least n times
|
|
{,n}? at least 0 but not more than n times (== {0,n}?)
|
|
|
|
possessive (greedy and does not backtrack after repeated)
|
|
|
|
?+ 1 or 0 times
|
|
*+ 0 or more times
|
|
++ 1 or more times
|
|
|
|
({n,m}+, {n,}+, {n}+ are possessive op. in ONIG_SYNTAX_JAVA only)
|
|
|
|
ex. /a*+/ === /(?>a*)/
|
|
|
|
|
|
5. Anchors
|
|
|
|
^ beginning of the line
|
|
$ end of the line
|
|
\b word boundary
|
|
\B not word boundary
|
|
\A beginning of string
|
|
\Z end of string, or before newline at the end
|
|
\z end of string
|
|
\G matching start position (*)
|
|
|
|
* Ruby Regexp:
|
|
previous end-of-match position
|
|
(This specification is not related to this library.)
|
|
|
|
|
|
6. Character class
|
|
|
|
^... negative class (lowest precedence operator)
|
|
x-y range from x to y
|
|
[...] set (character class in character class)
|
|
..&&.. intersection (low precedence at the next of ^)
|
|
|
|
ex. [a-w&&[^c-g]z] ==> ([a-w] AND ([^c-g] OR z)) ==> [abh-w]
|
|
|
|
* If you want to use '[', '-', ']' as a normal character
|
|
in a character class, you should escape these characters by '\'.
|
|
|
|
|
|
POSIX bracket ([:xxxxx:], negate [:^xxxxx:])
|
|
|
|
Not Unicode Case:
|
|
|
|
alnum alphabet or digit char
|
|
alpha alphabet
|
|
ascii code value: [0 - 127]
|
|
blank \t, \x20
|
|
cntrl
|
|
digit 0-9
|
|
graph include all of multibyte encoded characters
|
|
lower
|
|
print include all of multibyte encoded characters
|
|
punct
|
|
space \t, \n, \v, \f, \r, \x20
|
|
upper
|
|
xdigit 0-9, a-f, A-F
|
|
|
|
|
|
Unicode Case:
|
|
|
|
alnum Letter | Mark | Decimal_Number
|
|
alpha Letter | Mark
|
|
ascii 0000 - 007F
|
|
blank Space_Separator | 0009
|
|
cntrl Control | Format | Unassigned | Private_Use | Surrogate
|
|
digit Decimal_Number
|
|
graph [[:^space:]] && ^Control && ^Unassigned && ^Surrogate
|
|
lower Lowercase_Letter
|
|
print [[:graph:]] | [[:space:]]
|
|
punct Connector_Punctuation | Dash_Punctuation | Close_Punctuation |
|
|
Final_Punctuation | Initial_Punctuation | Other_Punctuation |
|
|
Open_Punctuation
|
|
space Space_Separator | Line_Separator | Paragraph_Separator |
|
|
0009 | 000A | 000B | 000C | 000D | 0085
|
|
upper Uppercase_Letter
|
|
xdigit 0030 - 0039 | 0041 - 0046 | 0061 - 0066
|
|
(0-9, a-f, A-F)
|
|
|
|
|
|
7. Extended groups
|
|
|
|
(?#...) comment
|
|
|
|
(?imx-imx) option on/off
|
|
i: ignore case
|
|
m: multi-line (dot(.) match newline)
|
|
x: extended form
|
|
(?imx-imx:subexp) option on/off for subexp
|
|
|
|
(?:subexp) not captured group
|
|
(subexp) captured group
|
|
|
|
(?=subexp) look-ahead
|
|
(?!subexp) negative look-ahead
|
|
(?<=subexp) look-behind
|
|
(?<!subexp) negative look-behind
|
|
|
|
Subexp of look-behind must be fixed character length.
|
|
But different character length is allowed in top level
|
|
alternatives only.
|
|
ex. (?<=a|bc) is OK. (?<=aaa(?:b|cd)) is not allowed.
|
|
|
|
In negative-look-behind, captured group isn't allowed,
|
|
but shy group(?:) is allowed.
|
|
|
|
(?>subexp) atomic group
|
|
don't backtrack in subexp.
|
|
|
|
(?<name>subexp) define named group
|
|
(All characters of the name must be a word character.
|
|
And first character must not be a digit or uppper case)
|
|
|
|
Not only a name but a number is assigned like a captured
|
|
group.
|
|
|
|
Assigning the same name as two or more subexps is allowed.
|
|
In this case, a subexp call can not be performed although
|
|
the back reference is possible.
|
|
|
|
|
|
8. Back reference
|
|
|
|
\n back reference by group number (n >= 1)
|
|
\k<name> back reference by group name
|
|
|
|
In the back reference by the multiplex definition name,
|
|
a subexp with a large number is referred to preferentially.
|
|
(When not matched, a group of the small number is referred to.)
|
|
|
|
* Back reference by group number is forbidden if named group is defined
|
|
in the pattern and ONIG_OPTION_CAPTURE_GROUP is not setted.
|
|
|
|
|
|
back reference with nest level
|
|
|
|
(This function is disabled in Ruby 1.9.)
|
|
|
|
\k<name+n> n: 0, 1, 2, ...
|
|
\k<name-n> n: 0, 1, 2, ...
|
|
|
|
Destinate relative nest level from back reference position.
|
|
|
|
ex 1.
|
|
|
|
/\A(?<a>|.|(?:(?<b>.)\g<a>\k<b+0>))\z/.match("reer")
|
|
|
|
ex 2.
|
|
|
|
r = Regexp.compile(<<'__REGEXP__'.strip, Regexp::EXTENDED)
|
|
(?<element> \g<stag> \g<content>* \g<etag> ){0}
|
|
(?<stag> < \g<name> \s* > ){0}
|
|
(?<name> [a-zA-Z_:]+ ){0}
|
|
(?<content> [^<&]+ (\g<element> | [^<&]+)* ){0}
|
|
(?<etag> </ \k<name+1> >){0}
|
|
\g<element>
|
|
__REGEXP__
|
|
|
|
p r.match('<foo>f<bar>bbb</bar>f</foo>').captures
|
|
|
|
|
|
|
|
9. Subexp call ("Tanaka Akira special")
|
|
|
|
\g<name> call by group name
|
|
\g<n> call by group number (n >= 1)
|
|
|
|
* left-most recursive call is not allowed.
|
|
ex. (?<name>a|\g<name>b) => error
|
|
(?<name>a|b\g<name>c) => OK
|
|
|
|
* Call by group number is forbidden if named group is defined in the pattern
|
|
and ONIG_OPTION_CAPTURE_GROUP is not setted.
|
|
|
|
* If the option status of called group is different from calling position
|
|
then the group's option is effective.
|
|
|
|
ex. (?-i:\g<name>)(?i:(?<name>a)){0} match to "A"
|
|
|
|
|
|
10. Captured group
|
|
|
|
Behavior of the no-named group (...) changes with the following conditions.
|
|
(But named group is not changed.)
|
|
|
|
case 1. /.../ (named group is not used, no option)
|
|
|
|
(...) is treated as a captured group.
|
|
|
|
case 2. /.../g (named group is not used, 'g' option)
|
|
|
|
(...) is treated as a no-captured group (?:...).
|
|
|
|
case 3. /..(?<name>..)../ (named group is used, no option)
|
|
|
|
(...) is treated as a no-captured group (?:...).
|
|
numbered-backref/call is not allowed.
|
|
|
|
case 4. /..(?<name>..)../G (named group is used, 'G' option)
|
|
|
|
(...) is treated as a captured group.
|
|
numbered-backref/call is allowed.
|
|
|
|
where
|
|
g: ONIG_OPTION_DONT_CAPTURE_GROUP
|
|
G: ONIG_OPTION_CAPTURE_GROUP
|
|
|
|
('g' and 'G' options are argued in ruby-dev ML)
|
|
|
|
These options are not implemented in Ruby level.
|
|
|
|
|
|
-----------------------------
|
|
A-1. Syntax depend options
|
|
|
|
+ ONIG_SYNTAX_RUBY
|
|
(?m): dot(.) match newline
|
|
|
|
+ ONIG_SYNTAX_PERL and ONIG_SYNTAX_JAVA
|
|
(?s): dot(.) match newline
|
|
(?m): ^ match after newline, $ match before newline
|
|
|
|
|
|
A-2. Original extensions
|
|
|
|
+ hexadecimal digit char type \h, \H
|
|
+ named group (?<name>...)
|
|
+ named backref \k<name>
|
|
+ subexp call \g<name>, \g<group-num>
|
|
|
|
|
|
A-3. Lacked features compare with perl 5.8.0
|
|
|
|
+ [:word:]
|
|
+ \N{name}
|
|
+ \l,\u,\L,\U, \X, \C
|
|
+ (?{code})
|
|
+ (??{code})
|
|
+ (?(condition)yes-pat|no-pat)
|
|
|
|
* \Q...\E
|
|
This is effective on ONIG_SYNTAX_PERL and ONIG_SYNTAX_JAVA.
|
|
|
|
* \p{property}, \P{property}
|
|
This is effective on ONIG_SYNTAX_PERL and ONIG_SYNTAX_JAVA.
|
|
Alnum, Alpha, Blank, Cntrl, Digit, Graph, Lower,
|
|
Print, Punct, Space, Upper, XDigit, ASCII are supported.
|
|
|
|
Prefix 'Is' of property name is allowed in ONIG_SYNTAX_PERL only.
|
|
ex. \p{IsXDigit}.
|
|
|
|
Negation operator of property is supported in ONIG_SYNTAX_PERL only.
|
|
\p{^...}, \P{^...}
|
|
|
|
|
|
A-4. Differences with Japanized GNU regex(version 0.12) of Ruby
|
|
|
|
+ add hexadecimal digit char type (\h, \H)
|
|
+ add look-behind
|
|
(?<=fixed-char-length-pattern), (?<!fixed-char-length-pattern)
|
|
+ add possessive quantifier. ?+, *+, ++
|
|
+ add operations in character class. [], &&
|
|
('[' must be escaped as an usual char in character class.)
|
|
+ add named group and subexp call.
|
|
+ octal or hexadecimal number sequence can be treated as
|
|
a multibyte code char in character class if multibyte encoding
|
|
is specified.
|
|
(ex. [\xa1\xa2], [\xa1\xa7-\xa4\xa1])
|
|
+ allow the range of single byte char and multibyte char in character
|
|
class.
|
|
ex. /[a-<<any EUC-JP character>>]/ in EUC-JP encoding.
|
|
+ effect range of isolated option is to next ')'.
|
|
ex. (?:(?i)a|b) is interpreted as (?:(?i:a|b)), not (?:(?i:a)|b).
|
|
+ isolated option is not transparent to previous pattern.
|
|
ex. a(?i)* is a syntax error pattern.
|
|
+ allowed incompleted left brace as an usual string.
|
|
ex. /{/, /({)/, /a{2,3/ etc...
|
|
+ negative POSIX bracket [:^xxxx:] is supported.
|
|
+ POSIX bracket [:ascii:] is added.
|
|
+ repeat of look-ahead is not allowed.
|
|
ex. /(?=a)*/, /(?!b){5}/
|
|
+ Ignore case option is effective to numbered character.
|
|
ex. /\x61/i =~ "A"
|
|
+ In the range quantifier, the number of the minimum is omissible.
|
|
/a{,n}/ == /a{0,n}/
|
|
The simultanious abbreviation of the number of times of the minimum
|
|
and the maximum is not allowed. (/a{,}/)
|
|
+ /a{n}?/ is not a non-greedy operator.
|
|
/a{n}?/ == /(?:a{n})?/
|
|
+ invalid back reference is checked and cause error.
|
|
/\1/, /(a)\2/
|
|
+ Zero-length match in infinite repeat stops the repeat,
|
|
then changes of the capture group status are checked as stop condition.
|
|
/(?:()|())*\1\2/ =~ ""
|
|
/(?:\1a|())*/ =~ "a"
|
|
|
|
|
|
A-5. Disabled functions by default syntax
|
|
|
|
+ capture history
|
|
|
|
(?@...) and (?@<name>...)
|
|
|
|
ex. /(?@a)*/.match("aaa") ==> [<0-1>, <1-2>, <2-3>]
|
|
|
|
see sample/listcap.c file.
|
|
|
|
|
|
A-6. Problems
|
|
|
|
+ Invalid encoding byte sequence is not checked in UTF-8.
|
|
|
|
* Invalid first byte is treated as a character.
|
|
/./u =~ "\xa3"
|
|
|
|
* Incomplete byte sequence is not checked.
|
|
/\w+/ =~ "a\xf3\x8ec"
|
|
|
|
// END
|