I wanted to make a regex that just works. This is it for ECMAScript engines.
Matches all 5024 Emoji specified in the official Unicode website's emoji-test.txt as of (6/14/2024, thank you not, Apple Intelligence).
This regex also fails glyphs which must be part of grapheme cluster but are solitary (more on this in the "Implementation" section)
-- These and similar fail
1 2 3 4 5 6 7 8 9 # * ‼ ↔
-- These succeed
Basic: 😀
Basic + Modifier: 🦸🏾
Basic + ZWJ + Basic: 🐦🔥
Basic + Modifier + ZWJ + Basic +: ❤️🔥
Basic + ZWJ + Basic + Modifier: 🐻❄️
Basic + Modifier + ZWJ + Basic + Modifier + ZWJ + Basic + ZWJ + Basic + Modifier: 👩🏼❤️💋👩🏿
Where ZWJ means "Zero-Width Joiner," a unicode character U+200D
which allows composition between two separate emojis, e.g:
😮💨 = 😮 (U+1F62E) + (U+200D) + 💨 (U+1F4A8)
In order to make this expression robust against new emojis being created, I used the inherent Unicode structure of emojis to validate the string.
Emojis have the following structure:
-- BEGIN
\p{Emoji} -- Class of basic, single-character emoji
-- BEGIN Optional Section
-- Case 1: Arbitrary amount of Non-ZWJ Modifier (skin, hair, simple-grapheme modifier, etc)
-- < negative look ahead for ZWJ >
\p{Emoji_Component}+
-- Case 2: ZWJ followed by basic emoji
-- < check for ZWJ >
\p{Emoji} -- We've composed a new emoji!
-- END Optional Section
-- * Repeat the optional section as many times as possible to get the longest chain of emojis joined by ZWJs
-- END
*
The emojis defined by \p{Emoji}
also contains characters that are not generally considered emojis like © or ❄, ✔. These glyphs may even be used as to compose new emojis as in the case of
🏋♂ = 🏋 (U+1F3CB) + (U+200D) + ♂ (U+2642)
Without being part of a larger grapheme cluster, this regex fails these glyphs. That's what the first negative lookahead checks: If you come across one of these glyphs, ensure that the following glyph is a specific variation code point (U+FE0F) they must have.
This variation is what turns ✔ into ✔️.
Also of note, there also some glyphs in this range which do act as conventional emojis like ✅ (U+2705). These can also be created with ✅ (U+2705 U+200D), adding a ZWJ at the end. If you continue to adding ZWJs, the graphical difference doesn't change, but you will have more characters to backspace through (at least on my MacBook).
This logic only matters when the glyphs is at the beginning of the match, otherwise it will be proceeded by a ZWJ.
So long as emojis are represented in the format specified above, this regex will be robust against new emojis being created because it uses character classes instead of fixed code point ranges.