regex

Links

Chomsky hierarchy
- Regular languages (Type 3) are a subset of
- Context-free languages (Type 2) are a subset of
- Context-sensitive languages (Type 1) are a subset of
- Recursively enumerable languages (Type 0).
FYI, if one is looking for a fairly full featured regex implementation for playing around, the .NET library's System.Text.RegularExpressions has quite a bit (more that PCRE).
RE2
Tools
Regular Expression Matching with a Trigram Index
Ragel: State Machine Compiler
The true power of regular expressions
- On matching context free language {a^n b^n, n>0} via the regex /^(a(?1)?b)$/.
  - Python re module
    - You can't express that via the built in re module.
    - Note that backreferences (like \n) match the actual matched text and not the expression represented by the group.
    - In fact, if you tried to re.compile(r'^(a\1*b)$'), you will get the error sre_constants.error: cannot refer to open group.)
  - Python regex library.
    - This one allows backreferences to expressions (i.e. recursive regex), so you can express the regex as regex.compile(r'(?V1)^(a(?1)*b)$') that works. (so regex.compile(r'(?V1)^(a(?1)*b)$').match('aaabbb') passes.)
Atomic Grouping
- Syntax (?>choices).
- It specifies a group that won't backtrack once it has been matched. For example, the regex (?>ab|a)b will match abb but not ab because once the atomic group as captured ab, there's no backtracking to try if some later part of the regex failed.
Possessive Quantifiers
- Syntax: X*+.
- It's a shorthand notation for Atomic Grouping.
- X*+ → is the same as (?>X*). If X was a group, remember to make sure that is is included as a group inside the Atomic Expression when you do the transform. (i.e. (?:a|b)*+ → (?>(?:a|b)*)).
Literal Characters and Special Characters
- \Q and \E is supported by PCRE and some others but NOT by Python even when using the regex library.
Unicode
- Canonical equivalence and matching
  - If you have Unicode literal characters in your pattern and have canonical equivalence turned on, then they will match any equivalent code point. For example, if your pattern included the character ö, then it will match that character whether it appears in the text as a single precomposed character (e.g. \u00F6 → ö) OR appears in the text as a two code points—\u006F (o) followed by the combining character for the umlaut \u0308. However, if instead you wrote it in terms of unicode code points (using the \uXXXX or \u{XXXX} syntaxes supported by your regex engine), it will NOT use Unicode equivalence and match only the specific code points that you called out.
  - Note also that the . character can match individual Unicode codepoints. For instance, if the text to be matched contains ö written as two code points (o followed by the umlaut combining character), the . is free to match JUST the o and not consume the combining character that follows. (Is this really true in Unicode mode or only when you're using quantifiers with the .?)
- Unicode Categories
Recursion and Subroutine Calls May or May Not Be Atomic
Flags / Modifiers
Reference
- Python Regex
  - New regex module.
  - re module
- Javascript Regexp