regex
Links
- Chomsky hierarchy
- Regular languages (Type 3) are a subset of
- Context-free languages (Type 2) are a subset of
- Context-sensitive languages (Type 1) are a subset of
- Recursively enumerable languages (Type 0).
- FYI, if one is looking for a fairly full featured regex
implementation for playing around, the .NET library's
System.Text.RegularExpressions has quite a bit (more
that PCRE).
- RE2
- Tools
- Regular Expression Matching with a Trigram Index
- Ragel: State Machine Compiler
- The true power of regular expressions
- On matching context free language
{a^n b^n, n>0} via the regex
/^(a(?1)?b)$/.
- Python
re module
- You can't express that via the built in
re module.
- Note that backreferences (like
\n) match the actual matched text and not the expression represented by the group.
- In fact, if you tried to
re.compile(r'^(a\1*b)$'), you will get the
error sre_constants.error: cannot refer to
open group.)
- Python regex library.
- This one allows backreferences to
expressions (i.e. recursive regex), so you
can express the regex as
regex.compile(r'(?V1)^(a(?1)*b)$') that works. (so
regex.compile(r'(?V1)^(a(?1)*b)$').match('aaabbb') passes.)
- Atomic Grouping
- Syntax
(?>choices).
- It specifies a group that won't backtrack once it
has been matched. For example, the regex
(?>ab|a)b will match abb but not ab because
once the atomic group as captured ab, there's no
backtracking to try if some later part of the regex
failed.
- Possessive Quantifiers
- Syntax:
X*+.
- It's a shorthand notation for Atomic Grouping.
X*+ → is the same as (?>X*). If X was a
group, remember to make sure that is is included as
a group inside the Atomic Expression when you do the
transform. (i.e. (?:a|b)*+ → (?>(?:a|b)*)).
- Literal Characters and Special Characters
\Q and \E is supported by PCRE and some others
but NOT by Python even when using the regex
library.
- Unicode
- Canonical equivalence and matching
- If you have Unicode literal characters in your
pattern and have canonical equivalence turned
on, then they will match any equivalent code
point. For example, if your pattern included
the character
ö, then it will match that
character whether it appears in the text as a
single precomposed character (e.g. \u00F6 →
ö) OR appears in the text as a two code
points—\u006F (o) followed by the combining
character for the umlaut \u0308. However,
if instead you wrote it in terms of unicode code
points (using the \uXXXX or \u{XXXX}
syntaxes supported by your regex engine), it
will NOT use Unicode equivalence and match only
the specific code points that you called out.
- Note also that the
. character can match
individual Unicode codepoints. For instance, if
the text to be matched contains ö written as
two code points (o followed by the umlaut
combining character), the . is free to match
JUST the o and not consume the combining
character that follows. (Is this really true in
Unicode mode or only when you're using
quantifiers with the .?)
- Unicode Categories
- Recursion and Subroutine Calls May or May Not Be Atomic
- Flags / Modifiers
- Reference