php-src/ext/mbstring/oniguruma/doc/RE

225 lines
5.9 KiB
Plaintext
Raw Normal View History

Oniguruma Regular Expressions 2003/07/04
syntax: REG_SYNTAX_RUBY (default)
1. Syntax elements
\ escape
| alternation
(...) group
[...] character class
2. Characters
\t horizontal tab (0x09)
\v vertical tab (0x0B)
\n newline (0x0A)
\r return (0x0D)
\b back space (0x08) (* in character class only)
\f form feed (0x0C)
\a bell (0x07)
\e escape (0x1B)
\nnn octal char
\xHH hexadecimal char
\x{7HHHHHHH} wide hexadecimal char
\cx control char
\C-x control char
\M-x meta (x|0x80)
\M-\C-x meta control char
3. Character types
. any character (except newline)
\w word character (alphanumeric, "_" and multibyte char)
\W non-word char
\s whitespace char (\t, \n, \v, \f, \r, \x20)
\S non-whitespace char
\d digit char
\D non-digit char
4. Quantifier
greedy
? 1 or 0 times
* 0 or more times
+ 1 or more times
{n,m} at least n but not more than m times
{n,} at least n times
{n} n times
reluctant
?? 1 or 0 times
*? 0 or more times
+? 1 or more times
{n,m}? at least n but not more than m times
{n,}? at least n times
possessive (greedy and does not backtrack after repeated)
?+ 1 or 0 times
*+ 0 or more times
++ 1 or more times
5. Anchors
^ beginning of the line
$ end of the line
\b word boundary
\B not word boundary
\A beginning of string
\Z end of string, or before newline at the end
\z end of string
\G previous end-of-match position
6. POSIX character class ([:xxxxx:], negate [:^xxxxx:])
alnum alphabet or digit char
alpha alphabet
ascii code value: [0 - 127]
blank \t, \x20
cntrl
digit 0-9
graph
lower
print
punct
space \t, \n, \v, \f, \r, \x20
upper
xdigit 0-9, a-f, A-F
7. Operators in character class
[...] group (character class in character class)
&& intersection
(lowest precedence operator in character class)
ex. [a-w&&[^c-g]z] ==> ([a-w] and ([^c-g] or z)) ==> [abh-w]
8. Extended expressions
(?#...) comment
(?imx-imx) option on/off
i: ignore case
m: multi-line (dot(.) match newline)
x: extended form
(?imx-imx:subexp) option on/off for subexp
(?:subexp) not captured
(?=subexp) look-ahead
(?!subexp) negative look-ahead
(?<=subexp) look-behind
(?<!subexp) negative look-behind
Subexp of look-behind must be fixed character length.
But different character length is allowed in top level
alternatives only.
ex. (?<=a|bc) is OK. (?<=aaa(?:b|cd)) is not allowed.
(?>subexp) don't backtrack
(?<name>subexp) define named group
(name can not include '>', ')', '\' and NUL character)
9. Back reference
\n back reference by group number (n >= 1)
\k<name> back reference by group name
10. Subexp call ("Tanaka Akira special")
\g<name> call by group name
\g<n> call by group number (only if 'n' is not defined as name)
-----------------------------
11. Original extensions
+ named group (?<name>...)
+ named backref \k<name>
+ subexp call \g<name>, \g<group-num>
12. Lacked features compare with perl 5.8.0
+ [:word:]
+ \N{name}
+ \l,\u,\L,\U, \P, \X, \C
+ (?{code})
+ (??{code})
+ (?(condition)yes-pat|no-pat)
+ \Q...\E (* This is effective on REG_SYNTAX_PERL and REG_SYNTAX_JAVA)
13. Syntax depend options
+ REG_SYNTAX_RUBY (default)
(?m): dot(.) match newline
+ REG_SYNTAX_PERL, REG_SYNTAX_JAVA
(?s): dot(.) match newline
(?m): ^ match after newline, $ match before newline
14. Differences with Japanized GNU regex(version 0.12) of Ruby
+ add look behind
(?<=fixed-char-length-pattern), (?<!fixed-char-length-pattern)
(in negative-look-behind, capture group isn't allowed,
shy group(?:) is allowed.)
+ add possessive quantifier. ?+, *+, ++
+ add operations in character class. [], &&
+ add named group and subexp call.
+ octal or hexadecimal number sequence can be treated as
a multibyte code char in char-class, if multibyte encoding is specified.
(ex. [\xa1\xa2], [\xa1\xa7-\xa4\xa1])
+ effect range of isolated option is to next ')'.
ex. (?:(?i)a|b) is interpreted as (?:(?i:a|b)), not (?:(?i:a)|b).
+ isolated option is not transparent to previous pattern.
ex. a(?i)* is a syntax error pattern.
+ allowed incompleted left brace as an usual char.
ex. /{/, /({)/, /a{2,3/ etc...
+ negative POSIX bracket [:^xxxx:] is supported.
+ POSIX bracket [:ascii:] is added.
+ repeat of look-ahead is not allowd.
ex. /(?=a)*/, /(?!b){5}/
14. Problems
+ Invalid first byte in UTF-8 is allowed.
(which is the same as GNU regex of Ruby)
/./u =~ "\xa3"
Of course, although it is possible to validate,
it will become later than now.
+ Zero-length match in infinite repeat stops the repeat,
and captured group status isn't checked as stop condition.
/()*\1/ =~ "" #=> match
/(?:()|())*\1\2/ =~ "" #=> fail
/(?:\1a|())*/ =~ "a" #=> match with ""
+ Ignore case option is not effect to an octal or hexadecimal
numbered char, but it becomes effective if it appears in the char class.
This doesn't have consistency, though they are the specifications
which are the same as GNU regex of Ruby.
/\x61/i.match("A") # => nil
/[\x61]/i.match("A") # => match
// END