This parses a line of C-like source code into the groups <indentation>
, <stmt>
, <whitespace>
, <comment>
, and <line_ending>
. It should be matched against a line of source code either containing or not containing the line ending.
If you run each line of a source file through this regex, you could do things like:
"\g<indentation>\g<stmt>\g<line_ending>"
."\g<indentation>\g<stmt>\g<whitespace>\g<comment>{new_line_ending}"
where {new_line_ending}
is replaced with the desired line ending.<indentation>
contains the indentation of each line, you could divide the length of its contents by number_of_spaces_per_indent to get the indentation level of each line. Might be useful in writing an editor macro or something to change the indentation of a block of code.<indentation>
contains the indentation of the line of code. Can be an empty string if the line is not indented.<stmt>
contains the statement/expression that the line of code contains, not including the indentation, trailing whitespace following the statement, and single-line comments. Can be an empty string if the line contains no statement (such as a blank line or a line that is all comment).<whitespace>
contains the whitespace between the statement and the beginning of the comment. Can be an empty string if the single-line comment immediately follows the statement with no whitespace.<comment>
contains the single-line comment of the line, including the character sequence that denotes that start of a comment. This group can be missing if the line has no comment.<line_ending>
contains the line ending of the line. This group can be missing if the line has no line ending.The character sequence which the regex uses to identify the start of a single-line comment is #
by default for parsing Python code. This can be changed to fit a different programming language by replacing all occurences of \#
in the regex with a different character sequence. Make sure to escape the sequence by preceding every character that has a special meaning in regex with a backslash.
"the 'quick' \\brown\\ \"fox\" jumps over the lazy\ncat\rdog\n"
would be understood as a single valid string literal, while "foo bar \ baz"
wouldn't match the regex due to the unescaped backslash).This regex does not understand multi-line comments. The first and last line of a Python docstring, for example ("""start of docstring
), would not match the regex at all as it would be seen as an empty string literal followed by an unclosed string literal. The remaining lines in the docstring would be parsed with the same rules as regular source code.
As I'm a regex beginner, this is probably horribly un-optimized, but at least it works.