Regular Expressions 101

Community Patterns

Full HTML recognition for Udacity Project

1

Regular Expression
PCRE2 (PHP >=7.3)

/
https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}(\?s=|\/)[a-zA-Z0-9()-.?_=&;#\/]{1,256}
/
gm

Description

Udacity AI for Trading - Project 6: Analyzing Stock Sentiment from Twits ( Sentiment Analysis) needs a HTML parsering. Examples came from the actual course exercise.

Future improvements

  • I could identify, make it a group and parser each /texttext/ pattern, if necessary ← perhaps it may have some value in the future, but I am not quite sure about it

  • I could identify if the text was a result of a search pattern ?s=SSS? (SSS ← for Some Stock Symbol), so if the user was looking for something about a specific stock or not, for future improvement on training

  • I could identify the key=9* pattern, so I know the information was retrieved from a logged user into a site and not from a general search ← perhaps I should give more weight for this message

Note: as i make the text.lower() statement BEFORE processing this pattern, I can improve performance just removing CAPITAL letters. So, you can use something as:

patt_url = "https?:\/\/(www\.)?[-a-z0-9@:%._\+~#=]{1,256}\.[a-z0-9()]{1,6}(\?s=|\/)[a-z0-9()-.?_=&;#\/]{1,256}"

Submitted by Eduardo Passeto - 8 months ago (Last modified 8 months ago)