Regular Expressions 101

Order By

Filter by Flavor

Community Patterns

Search among 3,460 community submitted regex patterns...

Community Library Entry

Regular Expression
PCRE2 (PHP >=7.3)

^(?P<indentation>(?P<sp_or_tab>[ \t])*+)(?P<stmt>(?P<non_str_lit>(?:(?!\#)[^\"'\r\n])*?)(?:(?P<str_lit>(?P<begin_quote>(?P<single_quote>')|\")(?:(?(single_quote)\"|')|\\[\"']|\\[^\r\n ]|[^\"'\\\r\n])*?(?P=begin_quote))(?&non_str_lit))*?)(?P<whitespace>(?&sp_or_tab)*+)(?P<comment>(?:\#)[^\r\n]*+)?(?P<line_ending>\r|\n|\r\n)?$

Open regex in editor

Description

This parses a line of C-like source code into the groups <indentation>, <stmt>, <whitespace>, <comment>, and <line_ending>. It should be matched against a line of source code either containing or not containing the line ending.

Use Cases

If you run each line of a source file through this regex, you could do things like:

Strip out all single-line comments, you can do this by replacing each match with "\g<indentation>\g<stmt>\g<line_ending>".
Refactor the line ending. You can do this by replacing each match with "\g<indentation>\g<stmt>\g<whitespace>\g<comment>{new_line_ending}" where {new_line_ending} is replaced with the desired line ending.
Since the group <indentation> contains the indentation of each line, you could divide the length of its contents by number_of_spaces_per_indent to get the indentation level of each line. Might be useful in writing an editor macro or something to change the indentation of a block of code.

Groups

<indentation> contains the indentation of the line of code. Can be an empty string if the line is not indented.
<stmt> contains the statement/expression that the line of code contains, not including the indentation, trailing whitespace following the statement, and single-line comments. Can be an empty string if the line contains no statement (such as a blank line or a line that is all comment).
<whitespace> contains the whitespace between the statement and the beginning of the comment. Can be an empty string if the single-line comment immediately follows the statement with no whitespace.
<comment> contains the single-line comment of the line, including the character sequence that denotes that start of a comment. This group can be missing if the line has no comment.
<line_ending> contains the line ending of the line. This group can be missing if the line has no line ending.

Comment Sequence

The character sequence which the regex uses to identify the start of a single-line comment is # by default for parsing Python code. This can be changed to fit a different programming language by replacing all occurences of \# in the regex with a different character sequence. Make sure to escape the sequence by preceding every character that has a special meaning in regex with a backslash.

Parsing Rules

A line can contain any number of string literals, including none.
The line wouldn't match the regex if it contains a malformed string literal.
Since the regex understands string literals, the character sequence which marks the start of a single-line comment only has an effect outside string literals.
Escape sequences are understood. Quotes that are of a different type from what the string literal itself is surrounded with can optionally not be escaped (so "the 'quick' \\brown\\ \"fox\" jumps over the lazy\ncat\rdog\n" would be understood as a single valid string literal, while "foo bar \ baz" wouldn't match the regex due to the unescaped backslash).

Multi-line Comments

This regex does not understand multi-line comments. The first and last line of a Python docstring, for example ("""start of docstring), would not match the regex at all as it would be seen as an empty string literal followed by an unclosed string literal. The remaining lines in the docstring would be parsed with the same rules as regular source code.

As I'm a regex beginner, this is probably horribly un-optimized, but at least it works.

Submitted by A-Paint-Brush - 4 months ago (Last modified 4 months ago)

Order By

Filter by Flavor

Community Patterns

Quality Value, RFC 9110 HTTP Semantics.

[React TypeScript] - Unnecessary brackets around strings

[React TypeScript] - Unnecessary backticks

Hexadecimal

js-br

prevent double zeros

OBprefix

quote-OBprefix

Canadian Postal Code Validation

Regex for Matching Documentation Websites

Any Unicode dash or its HTML escaped version

test

Credit Cart Prediction & Validation

Search for markup elements with an attribute

short IPV4 Capture

Teste RISOLUTO button com span ícone

Class name validator

Bitbucket URL Parser

GitLab URL Parser

GitHub URL Parser

Community Library Entry

Regular ExpressionPCRE2 (PHP >=7.3)

Description

Description

Use Cases

Groups

Comment Sequence

Parsing Rules

Multi-line Comments

Regular Expression
PCRE2 (PHP >=7.3)