Regular Expressions 101

Community Patterns

A 'Bulletproof' URL Regular Expression Parser

1

Regular Expression
PCRE (PHP <7.3)

/
((?<SCHEME>(?:https?))(?<HOSTNAME>\:\/\/(?:www.|[a-zA-Z.]+)[a-zA-Z0-9\-\.]+\.(?:biz|ca|com|edu|gov|info|me|mil|museum|name|net|org|uk|ru|us))?(?<PORT>\:[0-9][0-9]{0,4}|[1-5][0-9]{4}|6[0-4][0-9]{3}|65[0-4][0-9]{2}|655[0-2][0-9]|6553[0-5])?(?<PATH>[a-zA-Z0-9\-\.\/]+)?(?<QUERY>(?:\?$|[a-zA-Z0-9\.\,\;\?\'\\\+&%\$\=~_\-\*]+))?(?<FRAGMENT>#[a-zA-Z0-9\-\.]+)?)
/
gm

Description

A well-documented, easy to modify bulletproof regular expression that matches complete URLs or URL components. Components include:

Schema

  • scheme/protocol: use an OR statement to add FTP, etc.
  • hostname: use OR statements to more TLDs
  • port: matches leading zero; to disable, change the first block in the subexpression to [1-9] from [0-9]
  • path
  • query: to capture subcomponents of the query string, delete the target symbol in the <?QUERY> subexpression
  • fragment

☛ The name of the captured subexpression is usually case-sensitive.

Substitution

Complete URL

${URL}, \1, or its equivalent, can be used to backreference or substitute the complete URL.

  • \1
  • $1
  • ${URL}
  • $+{URL}

URL Component Map

${SCHEME}   → \2
${HOSTNAME} → \3
${PORT}     → \4
${PATH}     → \5
${QUERY}    → \6
${FRAGMENT} → \7

Plug & Play substitution/backreference alternatives

  • \2\3\4\5\6\7
  • $2$3$4$5$6$7
  • ${2}${3}${4}${5}${6}${7}
  • ${SCHEME}${HOSTNAME}${PORT}${PATH}${QUERY}${FRAGMENT}
  • $+{SCHEME}$+{HOSTNAME}$+{PORT}$+{PATH}$+{QUERY}$+{FRAGMENT}
Submitted by anonymous - 3 years ago