Regular Expressions 101

Community Patterns

Capture Paragraph Text in html

1

Regular Expression
Python

r"
(?#This tag is the start of the paragraph)<p>(?# This lookahead makes sure paragraph isn't an empty line. Needs it.)(?![\n|<])(?#This group captures the paragraph.)(.{0,500}?)(?#This negative lookahead/custom character class prevents <tags>)(?![^<][\s]|[\w]+[^>])(?#This tag is the end of the paragraph)</p>(?# This lookahead makes sure the paragraph ends at the end of the line.)(?=\n|<div|<p>)
"
gms

Description

TLDR: It can be used to capture all the paragraphs from some webnovel sites. *update: testing other sites, I've found it lacking.

This is intended to match a <p> tag and it's closing tag. It only matches and captures if there are no other html tags nested in the <p> tag.

There is room to adjust with the quantifier.

Condensed regex:

<p>(?![\n|<])(.{0,500}?)(?![^<][\s]|[\w]+[^>])</p>(?=\n|<div)
Submitted by anonymous - 4 months ago (Last modified 4 months ago)