Regular Expressions 101

Community Patterns

HTML Document or Fragment Heuristic

0

Regular Expression
Python

r"
<!DOCTYPE html>|</?\s*[a-z-][^>]*\s*>|(\&(?:[\w\d]+|#\d+|#x[a-f\d]+);)
"
g

Description

From the StackOverflow question "Check if a string is html or not" there are many examples and a few approaches, some more expensive than others. Full DOM parsing and detecting constructed DOM elements is sure-fire, but slow, requiring a browser with parser and DOM representation. Regular expressions are often frowned upon, but they may get a bad rap in solving this problem. This isn't trying to parse HTML, only identify if it is likely present. A heuristic, not a definitive answer.

Made more complicated by virtue of the fact that browsers will accept almost anything and go out of their way to attempt to parse it HTML.

Submitted by Alice Bevan-McGregor - 3 years ago