clock menu more-arrow no yes

Filed under:

Optimizing RegEx for Emoji

Emoji support was requested for Vox Media's live blogging platforms. In considering an approach for this feature, there were two goals: 1) use GitHub's gemoji gem as an established source of emoji data and graphics, and 2) parse these data ubiquitously into live blogging texts as HTML images. For #2, I opened Rubular and started banging out regular expressions, which soon produced a complete parsing tool. While this article will focus on adventures in regex optimization, let's first answer a tangentially related question of human interest...

What ARE Emoji?

Our beloved smiley faces :smile: and piles of poop :poop: are really just Unicode text characters with special font graphics mapped to them. Glossing over the finer points of Unicode, let's imagine that we tell a device: "Hey device, whenever you encounter the letter 'X' in a block of text, render a smiley face for it". That's the general idea.

Unfortunately, different devices handle this mapping of text characters into emoji icons in different ways. On some devices, emojis are identified by a pattern of two or more text characters in sequence, rather than by a single character. While "X" may produce a smiley face on our imaginary device, another device may require the pattern "XY". Due to this inconsistent mapping between devices, many websites simply replace emoji text characters with standard HTML image tags to assure proper display.

A Basic Emoji RegEx

If emoji characters are defined by text patterns, then we just need a regular expression that matches all 845 unique emojis. Piece of cake :shortcake:, right? To start, let's stage another imaginary example with emojis mapped to basic human-readable characters:

  • :apple: is A, OR the pattern AX
  • :bee: is B, OR the pattern BX
  • :cat: is C, OR the pattern CX

To match Apple, Bee, Cat, and the other 842 emojis, this simple regex pattern should get the job done:

/(AX|A|BX|B|CX|C| ... 842 other pairs ... )/g

The only trick here is that the multi-character pattern for each emoji needs to come before its single-character alternative, otherwise the shorter pattern would eagerly return partial matches.

Okay, that pattern finds all 845 emojis. However, it works by matching two patterns per emoji... that's 1690 patterns, and half of those require matching multiple characters! Running my parser's test suite using this regex clocks in around 800 milliseconds. Let's try to lower that.

Optimization, first pass

For starters, let's consolidate each emoji matcher down to a single pattern, and just make the delta character optional for each:

/(AX?|BX?|CX?| ... 842 others ... )/g

Nice. We've halved the number of pattern alternatives from 1690 down to 845, and each pattern now only mandates a single hard character match before moving on. But wait: all patterns appear to have the same optional character, X?, so really we can condense this down further:

/((?:A|B|C| ... 842 others ... )X?)/g

Running the parser's test suite now clocks in around 400ms, about 200% faster than before. Woot.

Optimization, second pass

Of course, there's always an outlier. In the case of emojis, it's those regional flags (:us:, :de:, :jp:)... Each flag uses a 2-4 character pattern:

  • :apple: is A, OR the pattern AX
  • :bee: is B, OR the pattern BX
  • :cat: is C, OR the pattern CX
  • :uk: is the pattern FF, OR the pattern FXFX

Darn. This breaks the nice rhythm of our first optimization pass, and drops that Union Jack into our regex with a new paired pattern: /((?:A|B|C|FXF|FF| ... 841 others ... )X?)/g

While this is uglier and probably less efficient, there are only a handful of regional flags within the set of 845 emoji. Such a small margin is probably negligible, right? Unfortunately, edge cases still add overhead to all executions of the regex. To speed this up, let's condense the Union Jack down to a single pattern by making its delta character optional, and then place this longer edge case at the end of the regex where it will be reached less often: /((?:A|B|C| ... 841 others ... |FX?F)X?)/g

Running the parser's test suite now clocks in around 120ms. Nice.

You may notice that patterns FFX and FXF would also be matched. This makes a reasonable assumption about the uniqueness of unicode sequences.

Optimization, crazy pass :tada:

The optimizations up to this point have focused on computational complexity, which makes our regex run faster for a computer. However, emoji is a human-centric domain if ever there was one: emotional iconography. So instead of just optimizing for a computer, it seems we should also optimize for our domain of human preference.

IF ONLY THE DATA EXISTED.

Oh, my... it does. Apparently we have hard numbers on the popularity of emoji symbols (ah, the Internet).

  • :apple: is A, OR the pattern AX, with a popularity of 1,500
  • :bee: is B, OR the pattern BX, with a popularity of 500
  • :cat: is C, OR the pattern CX, with a popularity of 100,000
  • :uk: is FF, OR the pattern FXFX, with a popularity of 5,000

With this new level of insight, the most statistically optimal regex should probably order emoji patterns by popularity, putting those most likely to be encountered first:

/((?:C|FX?F|A|B| ... 841 others ... )X?)/g

Running the parser's test suite now clocks in around 150ms, which is slower than before now that we've tailored to a domain outside of our test data. That's okay though. This is why automated tests can't always tell the full story of performance.

What about Emoticons?

Emoticons are everyone's favorite ascii art faces, :-) and they are surprisingly efficient to parse due to whitespace rules. Because an emoticon pattern must either start a line of text, or end it, or be surrounded by spaces, we can simply check whitespace using lookarounds before ever matching actual patterns:

/((?<=^|\s)(?: ... patterns ... )(?=\s|$))/g

Once we do get to pattern matching, we discover a huge amount of duplication among standard emoticons thanks to nosed :-) and noseless :) variations. Unless... the noses were optional:

/:-?\)/

While optional noses are good fun, the parser's test suite shows negligible performance gains from this optimization. Whitespace rules must already be doing good work.

Wrap

The gemoji-parser project is open source on GitHub, and includes all generated regex outputs discussed here. Big thanks to GitHub for gemoji, Matthew Rothenberg for emojitracker (respect, dude), and to Vox Media's own Michael Lovitt for the outstanding Rubular regex testing tool. Thanks also to you for reading. :vox_media: