Regular Expressions 101

Community Patterns

Community Library Entry

1

Regular Expression
PCRE2 (PHP >=7.3)

/
^(?P<indentation>(?P<sp_or_tab>[ \t])*+)(?P<stmt>(?P<non_str_lit>(?:(?!\#)[^\"'\r\n])*?)(?:(?P<str_lit>(?P<begin_quote>(?P<single_quote>')|\")(?:(?(single_quote)\"|')|\\[\"']|\\[^\r\n ]|[^\"'\\\r\n])*?(?P=begin_quote))(?&non_str_lit))*?)(?P<whitespace>(?&sp_or_tab)*+)(?P<comment>(?:\#)[^\r\n]*+)?(?P<line_ending>\r|\n|\r\n)?$
/
gm

Description

Description

This parses a line of C-like source code into the groups <indentation>, <stmt>, <whitespace>, <comment>, and <line_ending>. It should be matched against a line of source code either containing or not containing the line ending.

Use Cases

If you run each line of a source file through this regex, you could do things like:

  1. Strip out all single-line comments, you can do this by replacing each match with "\g<indentation>\g<stmt>\g<line_ending>".
  2. Refactor the line ending. You can do this by replacing each match with "\g<indentation>\g<stmt>\g<whitespace>\g<comment>{new_line_ending}" where {new_line_ending} is replaced with the desired line ending.
  3. Since the group <indentation> contains the indentation of each line, you could divide the length of its contents by number_of_spaces_per_indent to get the indentation level of each line. Might be useful in writing an editor macro or something to change the indentation of a block of code.

Groups

  1. <indentation> contains the indentation of the line of code. Can be an empty string if the line is not indented.
  2. <stmt> contains the statement/expression that the line of code contains, not including the indentation, trailing whitespace following the statement, and single-line comments. Can be an empty string if the line contains no statement (such as a blank line or a line that is all comment).
  3. <whitespace> contains the whitespace between the statement and the beginning of the comment. Can be an empty string if the single-line comment immediately follows the statement with no whitespace.
  4. <comment> contains the single-line comment of the line, including the character sequence that denotes that start of a comment. This group can be missing if the line has no comment.
  5. <line_ending> contains the line ending of the line. This group can be missing if the line has no line ending.

Comment Sequence

The character sequence which the regex uses to identify the start of a single-line comment is # by default for parsing Python code. This can be changed to fit a different programming language by replacing all occurences of \# in the regex with a different character sequence. Make sure to escape the sequence by preceding every character that has a special meaning in regex with a backslash.

Parsing Rules

  • A line can contain any number of string literals, including none.
  • The line wouldn't match the regex if it contains a malformed string literal.
  • Since the regex understands string literals, the character sequence which marks the start of a single-line comment only has an effect outside string literals.
  • Escape sequences are understood. Quotes that are of a different type from what the string literal itself is surrounded with can optionally not be escaped (so "the 'quick' \\brown\\ \"fox\" jumps over the lazy\ncat\rdog\n" would be understood as a single valid string literal, while "foo bar \ baz" wouldn't match the regex due to the unescaped backslash).

Multi-line Comments

This regex does not understand multi-line comments. The first and last line of a Python docstring, for example ("""start of docstring), would not match the regex at all as it would be seen as an empty string literal followed by an unclosed string literal. The remaining lines in the docstring would be parsed with the same rules as regular source code.

As I'm a regex beginner, this is probably horribly un-optimized, but at least it works.

Submitted by A-Paint-Brush - 3 months ago (Last modified 2 months ago)