Regular Expressions 101

Community Patterns

Community Library Entry

1

Regular Expression
Python

r"
^(?:(?:(?=a+((?(1)\1)b))a)*\1)?$
"
gm

Description

Matches some number of a's followed by the same number of b's, usually written as a^nb^n for all natural numbers n. For regular expressions as formally defined in formal language theory, this is impossible, but modern day regex engines can do far more than match just regular languages. Note that even though this works with the Python flavor on regex101, this doesn't actually compile with Python's re library, since their regex engine doesn't allow you to:

  1. Lazily evaluate conditional groups: (?(1)\1) requires \1 to exist, even though if it doesn't exist the pattern would just match nothing and keep parsing.
  2. Reference open groups: This is kind of related to the previous one, since we need to reference the previous value of a group within itself, and we need to use a conditional for before the group first matches anything. As per 1, this requires the group to exist even if we don't use the value, hence it doesn't work.

Here is a short explanation on how the regex works: The key part of the regex is this lookahead (?=a+((?(1)\1)b)) wrapped by ((?=...)a)*. The lookahead skips through the a's, and then group one ((?(1)\1)b)) matches either b if group one doesn't exist (the first "iteration", where we iterate via the outer *), or group one and a b if it does exist. The effect of this is that we iterate the * group until either we run out of b's or we run out of a's. After that, we just need to check that we have group one followed by the end of the string at the end, since those are the b's. If we have too many a's, we would be matching against a...b...b, and if we have too many b's we wouldn't hit the end of the string. One small caveat is that if we have an empty string, group one doesn't exist, which we can remedy by either putting (?(1)\1) or just wrapping everything in an optional group.

Submitted by FieryIceStickie - a month ago