Community Patterns

19

Get path from any text

Created·2023-01-31 14:38
Updated·2023-07-23 20:17
Flavor·PCRE2 (PHP)
Recommended·
Get path (windows style) from any type of text (error message, e-mail corps ...), quoted or not. THIS IS THE SINGLE LINE VERSION ! If you want understand how it work or edit it, go https://regex101.com/r/7o2fyy Relative path are not supported The goal is to catch what "Look like" a path. See the limitations UNC path and prefix path like //./], [//?/] or [//./UNC/] are allowed some url path like [file:///C:/] or [file://] are allowed Catch path quoted with ["] and [']. But these quotes are include with the catch Quoted path is not concerned by limitations Limitations : (only unquoted path) [dot] and [space] is allowed, but not in a row [dot+space] or [space+dot at end of file name isn't catched INSIDE A NAME FILE (or last directory if it is a path to a directory) : [comma] is not supported (it stop the catch) after a first [dot], any [space] stop the catch after a [space], catch is stoped if next character is not a [letter], [digit] or [-] so, double [space] stop the catch Compatibility compatible PCRE, PCRE2 AutoHotkey : don't forget to escape "%" in "`%" /!\ Powershell and .Net /!\\ : this regex need some modification to be interpreted by powershell. You have to replace each (?&CapturGroupName) by \k. Use this powershell code to do this replacement : ` $powershellRegex = @' [Put here the regex to replace (?&CapturGroupName) with \k] '@ -replace '\(\?&(\w+)\)', '\k' ` This example code must return : [Put here the regex to replace \k with \k]
Submitted by nitrateag

Community Library Entry

1

Regular Expression
Created·2024-12-05 02:56
Updated·2024-12-05 03:24
Flavor·.NET 7.0 (C#)

@"
((\r\n|\r|\n){2,}|\p{Zp}+|\A)^\s*\S
"
gmnN
Open regex in editor

Description

Finds all paragraphs in the input text, where a paragraph is defined as any occurrence of a non-whitespace character immediately following any of the following and any other preceding whitespace:

  • 2 or more consecutive CRLF sequences
  • 2 or more consecutive CR characters
  • 2 or more consecutive LF characters
  • 1 or more Unicode Paragraph Separator class characters
  • The beginning of the string (matches the first paragraph)

Again, note that whitespace mixed in with the above will not interfere with the matching, as demonstrated by the test text included.

This is intended to be used with the options specified, so be sure to include them for best performance (non-backtracking, multiline, non-capturing, invariant culture).

This will work effectively on any version of .net that supports the included syntax. However, it is intended for use with .net8.0 and up, with the Regex.EnumerateMatches() method, or, more ideally, with .net9.0 and up, using the new Regex.EnumerateSplits() method, to avoid allocations associated with Match objects.

Unicode paragraph separator characters are very rare in practice and support for them is almost non-existent in software, including the Windows Console. Windows Terminal, web browsers, the Windows clipboard, notepad, Visual Studio, and notepad++, all of which fail to handle it in their own ways, none of them actually adding a line when they occur (though notepad++ will show it as PS if you have enabled showing all whitespace).

It is safe to remove |\p{Zp}+ from the pattern, if you do not wish to include those characters in your search. The resulting pattern, as a c# string, would be:

"((\\r\\n|\\r|\\n){2,}|\\A)^\\s*\\S"
Submitted by dodexahedron