Community Library Entry

Regular Expression
Created·2024-09-19 04:55
Flavor·JavaScript

(?:[^\p{Script=Latin}\s\w;:.,\-[\](){}'"+\/=<>])\B|(?:(?:[\p{Script=Latin}]|[^\s\w;:.,\-[\](){}'"+\/=<>]){2,}(?:(?:[\p{Script=Latin}]|[^\s\w+.,–:;\/\\=<>])?(?:[\p{Script=Latin}]|[^\s\w;\-:.,\[\](){}'"+\/\\=<>]))*)|(?:(?:[0-9\p{Script=Latin}]|[^\s\w;:.,[\](){}\-'"+\/\\=<>])(?:(?:[0-9\p{Script=Latin}]|[^\s\w+\-:;\/\\=<>])?(?:[0-9\p{Script=Latin}]|[^\s\w;:.,\-[\](){}'"+\/\\=<>]))*)

gmui

Open regex in editor

Description

Word Patterns - multilanguage

Separating words that handle many use cases I needed, more than simple `/\bword\b/gi`.

Below are some targeted test cases and acceptable failed cases:

Bão Yagi (được Việt Nam định danh là bão số 3, được phía Philippines đặt tên bão Enteng - tiếng Anh: Severe Tropical Storm Enteng , nguyên văn 'Bão nhiệt đới dữ dội Enteng')
Some legal "[d]ocuments" contain corrected spel[l]ing, gram(m)ar, or simple typos; and lots of references[1]. By Extension, I included curl{e}y brackets, but not tag brack<e>ts, which are not seenin modern legal documents. +These.are.properly.separated.even.U.S.A., and multiple punctuations are properly ignored.
Basic patterns
- A. Multiple
- B. Choices
- -ABC-DEF-
- -A-B-C-D-
- -1-2-3-4-
- -test-
- .ABC.DEF.
- .AB.CD.EF.D.
- .A.B.C.D.E.
- .123.456.789.
- .12.34.56.78.
- .1.2.3.4.
- cod3 var1aBl3s
- test.U.S.A.test
We'd want hyphenated words in cases when large words are broken for wrapping in tight column news papers/megazines, while we can still properly separate numbers such as "30-35 pages".
Non-Latin character are separated per character:
- 出典: フリー百科事典『ウィキペディア（Wikipedia）』
- ウィキペディアには現在この名前の項目はありません。

Acceptable failed cases:

[E]xpected: "[" is considered external "wrapper/enclosure", while the internal "wrappers" are included so that they can be further processed/removed in the future.
test.U.S.A.test: this happens when no space trailing textContents of a block-level elemetn in HTML files.
『ウィキペディア（Wikipedia）』: mixed languages
CamelCasing not separated, but can be easily separated in "post-processing step, even though snake_casing works fine by happenstance.
cod3 var1aBl3s: mixing letters and numbers. It's not intentional, but not a big deal when the side-effect is new word is always started with a number.

Submitted by Bryan Ho

Order By

Filter by Flavor

Community Patterns

Cron schedule

Parsing browser User Agents

Get path from any text

Strict Password Validator

Conventional Commits validation

RFC3339 DateTime

CSS Color Matcher

IP Address (with 0s)

Find Reddit Threads

Email regex validation

extract subdomain(if available) or domain from URL

Find and extract email domain

Quartz Cron Validation

us postal/zip

grab valid css rules and properties

entity

Distinguish torrent files (series vs movies)

US Currency Format

Fixo LDN CSP e SEM CSP

regex101.com permalink id grabber

Matching email addresses per RFC5322

MikroTik FireWall

remove comments from php code

Match Gmail Email

Credit Card Expiry Date

Regex for Validating Egyptian Mobile Numbers with Specific Operator Codes

C# Regex Extract/Match Nested HTML Elements/Tags

Email

among us references

get specific value from html tag

Match Only Text

psswd

Form Tag

look for any $_POST['name']

import url image

Validate hex color

Codice fiscale italiano

Variable name of code

relative to absolute

Remove Widows (PHP)

Extract currency with currency symbol if present

simple email

IOS3166 Country Code Identification REGEX

validate US (5 & 9 digit) ZipCodes and CA PostalCodes

TimezoneOffset

[A-Za-z\x{0600}-\x{06FF}\x{1000}-\x{1021}_-][-\w]*

SO: fail2ban regular to find 403 request in nginx

Regex for telephone numbers all over the world

semver

Hashtag