Community Patterns

1

Paragraph Delimiter Counter (Unicode-Aware)

Created·2024-12-05 02:56
Updated·2024-12-05 03:24
Flavor·.NET 7.0 (C#)
Finds all paragraphs in the input text, where a paragraph is defined as any occurrence of a non-whitespace character immediately following any of the following and any other preceding whitespace: 2 or more consecutive CRLF sequences 2 or more consecutive CR characters 2 or more consecutive LF characters 1 or more Unicode Paragraph Separator class characters The beginning of the string (matches the first paragraph) Again, note that whitespace mixed in with the above will not interfere with the matching, as demonstrated by the test text included. This is intended to be used with the options specified, so be sure to include them for best performance (non-backtracking, multiline, non-capturing, invariant culture). This will work effectively on any version of .net that supports the included syntax. However, it is intended for use with .net8.0 and up, with the Regex.EnumerateMatches() method, or, more ideally, with .net9.0 and up, using the new Regex.EnumerateSplits() method, to avoid allocations associated with Match objects. Unicode paragraph separator characters are very rare in practice and support for them is almost non-existent in software, including the Windows Console. Windows Terminal, web browsers, the Windows clipboard, notepad, Visual Studio, and notepad++, all of which fail to handle it in their own ways, none of them actually adding a line when they occur (though notepad++ will show it as PS if you have enabled showing all whitespace). It is safe to remove |\p{Zp}+ from the pattern, if you do not wish to include those characters in your search. The resulting pattern, as a c# string, would be: "((\\r\\n|\\r|\\n){2,}|\\A)^\\s*\\S"
Submitted by dodexahedron

Community Library Entry

83

Regular Expression
Created·2014-06-26 09:59
Updated·2023-07-20 15:08
Flavor·Python

r"
^ # get the title of this movie or series (?P<title> [-\w'\"]+ # match separator to later replace into correct title (?P<separator> [ .] ) # note this *must* be lazy for the engine to work ltr not rtl (?: [-\w'\"]+\2 )*? ) # start of movie vs serie check (?: # if this is an episode, lets match the season # number one way or another. if not, the year # of the movie (?: # series. can be a lot prettier if we used perl regex... # make sure this is not just a number in the title followed by our separator. # like, iron man 3 2013 or my.fictional.24.series (?! \d+ \2 ) # now try to match the season number (?: s (?: eason \2? )? )? (?P<season> \d\d? ) # needed to validate the last token is a dot, or whatever. (?: e\d\d? (?:-e?\d\d?)? | x\d\d? )? | # this is likely a movie, match the year (?P<year> [(\]]?\d{4}[)\]]? ) ) # make sure this ends with the separator, otherwise we # might be in the middle of something like "1080p" (?=\2) | # if we get here, this is likely still a movie. # match until one of the keywords (?= BOXSET | XVID | DIVX | LIMITED | UNRATED | PROPER | DTS | AC3 | AAC | BLU[ -]?RAY | HD(?:TV|DVD) | (?:DVD|B[DR]|WEB)RIP | \d+p | [hx]\.?264 ) )
"
gimx
Open regex in editor

Description

A neat regex for finding out whether a given torrent name is a series or a movie.

Returns the full name of the series with the separator needed to make it pretty (ie, replace it with space or what you want). Also returns the season number or the year for the movie/series, depending on what was previously matched.

Submitted by Firas Dib