php-src/ext/mbstring/oniguruma/doc/RE

Oniguruma Regular Expressions     2003/07/04

syntax: REG_SYNTAX_RUBY (default)


1. Syntax elements

  \       escape
  |       alternation
  (...)   group
  [...]   character class


2. Characters

  \t           horizontal tab (0x09)
  \v           vertical tab   (0x0B)
  \n           newline        (0x0A)
  \r           return         (0x0D)
  \b           back space     (0x08) (* in character class only)
  \f           form feed      (0x0C)
  \a           bell           (0x07)
  \e           escape         (0x1B)
  \nnn         octal char
  \xHH         hexadecimal char
  \x{7HHHHHHH} wide hexadecimal char
  \cx          control char
  \C-x         control char
  \M-x         meta  (x|0x80)
  \M-\C-x      meta control char


3. Character types

  .        any character (except newline)
  \w       word character (alphanumeric, "_" and multibyte char)
  \W       non-word char
  \s       whitespace char (\t, \n, \v, \f, \r, \x20)
  \S       non-whitespace char
  \d       digit char
  \D       non-digit char


4. Quantifier

  greedy

  ?       1 or 0 times
  *       0 or more times
  +       1 or more times
  {n,m}   at least n but not more than m times
  {n,}    at least n times
  {n}     n times

  reluctant

  ??      1 or 0 times
  *?      0 or more times
  +?      1 or more times
  {n,m}?  at least n but not more than m times
  {n,}?   at least n times

  possessive (greedy and does not backtrack after repeated)

  ?+      1 or 0 times
  *+      0 or more times
  ++      1 or more times


5. Anchors

  ^       beginning of the line
  $       end of the line
  \b      word boundary
  \B      not word boundary
  \A      beginning of string
  \Z      end of string, or before newline at the end
  \z      end of string
  \G      previous end-of-match position


6. POSIX character class  ([:xxxxx:], negate [:^xxxxx:])

  alnum    alphabet or digit char
  alpha    alphabet
  ascii    code value: [0 - 127]
  blank    \t, \x20
  cntrl
  digit    0-9
  graph
  lower
  print
  punct
  space    \t, \n, \v, \f, \r, \x20
  upper
  xdigit   0-9, a-f, A-F


7. Operators in character class

  [...]   group (character class in character class)
  &&      intersection
         (lowest precedence operator in character class)

  ex. [a-w&&[^c-g]z] ==> ([a-w] and ([^c-g] or z)) ==> [abh-w]


8. Extended expressions

  (?#...)              comment
  (?imx-imx)           option on/off
                         i: ignore case
                         m: multi-line (dot(.) match newline)
                         x: extended form
  (?imx-imx:subexp)    option on/off for subexp
  (?:subexp)           not captured
  (?=subexp)           look-ahead
  (?!subexp)           negative look-ahead
  (?<=subexp)          look-behind
  (?<!subexp)          negative look-behind

                       Subexp of look-behind must be fixed character length.
                       But different character length is allowed in top level
                       alternatives only.
                       ex. (?<=a|bc) is OK. (?<=aaa(?:b|cd)) is not allowed.

  (?>subexp)           don't backtrack
  (?<name>subexp)      define named group
                       (name can not include '>', ')', '\' and NUL character)


9. Back reference

  \n          back reference by group number (n >= 1)
  \k<name>    back reference by group name


10. Subexp call ("Tanaka Akira special")

  \g<name>    call by group name
  \g<n>       call by group number (only if 'n' is not defined as name)


-----------------------------
11. Original extensions

   + named group     (?<name>...)
   + named backref   \k<name>
   + subexp call     \g<name>, \g<group-num>


12. Lacked features compare with perl 5.8.0

   + [:word:]
   + \N{name}
   + \l,\u,\L,\U, \P, \X, \C
   + (?{code})
   + (??{code})
   + (?(condition)yes-pat|no-pat)

   + \Q...\E   (* This is effective on REG_SYNTAX_PERL and REG_SYNTAX_JAVA)


13. Syntax depend options

   + REG_SYNTAX_RUBY (default)
     (?m): dot(.) match newline

   + REG_SYNTAX_PERL, REG_SYNTAX_JAVA
     (?s):  dot(.) match newline
     (?m): ^ match after newline, $ match before newline


14. Differences with Japanized GNU regex(version 0.12) of Ruby

   + add look behind
     (?<=fixed-char-length-pattern), (?<!fixed-char-length-pattern)
     (in negative-look-behind, capture group isn't allowed,
      shy group(?:) is allowed.)
   + add possessive quantifier. ?+, *+, ++
   + add operations in character class. [], &&
   + add named group and subexp call.
   + octal or hexadecimal number sequence can be treated as
     a multibyte code char in char-class, if multibyte encoding is specified.
     (ex. [\xa1\xa2], [\xa1\xa7-\xa4\xa1])
   + effect range of isolated option is to next ')'.
     ex. (?:(?i)a|b) is interpreted as (?:(?i:a|b)), not (?:(?i:a)|b).
   + isolated option is not transparent to previous pattern.
     ex. a(?i)* is a syntax error pattern.
   + allowed incompleted left brace as an usual char.
     ex. /{/, /({)/, /a{2,3/ etc...
   + negative POSIX bracket [:^xxxx:] is supported.
   + POSIX bracket [:ascii:] is added.
   + repeat of look-ahead is not allowd.
     ex. /(?=a)*/, /(?!b){5}/


14. Problems

   + Invalid first byte in UTF-8 is allowed.
     (which is the same as GNU regex of Ruby)

       /./u =~ "\xa3"

     Of course, although it is possible to validate,
     it will become later than now.

   + Zero-length match in infinite repeat stops the repeat,
     and captured group status isn't checked as stop condition.

       /()*\1/ =~ ""            #=> match
       /(?:()|())*\1\2/ =~ ""   #=> fail

       /(?:\1a|())*/ =~ "a"     #=> match with ""

   + Ignore case option is not effect to an octal or hexadecimal
     numbered char, but it becomes effective if it appears in the char class.
     This doesn't have consistency, though they are the specifications
     which are the same as GNU regex of Ruby.

       /\x61/i.match("A")     # => nil
       /[\x61]/i.match("A")   # => match

// END