Replacing a Complex Regular Expression with a Simple Parser

When we have to work with text, we often reach for regular expressions. But they're not the easiest solution to every problem. Recently I was struggling with a large regexp, when I realized it'd be less work to write a parser. This article explains the process.

Confession time: I don't particularly like working with regular expressions. While I use them all the time, anything more complex than a /^foo.*$/ requires me to stop and think. While I'm sure there are people who can decipher expressions like \A(?=\w{6,10}\z)(?=[^a-z]*[a-z])(?=(?:[^A-Z]*[A-Z]){3}) at a glance, but it takes me several minutes of googling and makes me unhappy. It's quite a difference from reading Ruby.

If you're curious, the example above is taken from this article on regex lookaheads.

The Situation

At Honeybadger I'm currently working on improving our search UI. Like many search systems, ours uses a simple query language. Before my changes, if you wanted to search for a custom date range, you had to manually type in a query like so:

occurred:[2017-06-12T16:10:00Z TO 2017-06-12T17:10:00Z]

Ouch!

In the new search UI, we want to detect when you start typing a date-related query and pop up a helpful datepicker. And of course, the datepicker is just the beginning. Eventually we'll expand the context-sensitive hinting to cover more kinds of search terms. Here are a few examples:

assigned:jane@email.com context.user.id=100
resolved:false ignored:false occurred:[
params.article.title:"Starr's parser post"       foo:'ba

I need to tokenize these strings in such a way that:

  • Whitespace separates tokens, except when surrounded by '', "" or []
  • Unquoted whitespace is its own token
  • I can run tokens.join("") to exactly recreate the input string

For example:

tokenize(%[params.article.title:"Starr's parser post"       foo:'ba])
=> ["params.article.title:\"Starr's parser post\"", "       ", "foo:'ba"]

Using A Regular Expression

My first thought was to use a capturing regular expression to define what a valid token should look like, then use String#split to split the string into tokens. It's a pretty cool trick, actually:

# The parens in the regexp mean that the separator is added to the array
"foo  bar  baz".split(/(foo|bar|baz)/)
=> ["", "foo", "  ", "bar", "  ", "baz"]

This looked promising initially, despite the weird empty-strings. But my real-world regular expression was much more complex. My first draft looked like this:

/
  (                          # Capture group is so split will include matching and non-matching strings
    (?:                      # The first character of the key, which is
      (?!\s)[^:\s"'\[]{1}    # ..any valid "key" char not preceeded by whitespace
      |^[^:\s"'\[]{0,1}      # ..or any valid "key" char at beginning of line
    )
    [^:\s"'\[]*              # The rest of the "key" chars
    :                        # a colon
    (?:                      # The "value" chars, which are
      '[^']+'                # ..anything surrounded by single quotes
      | "[^"]+"              # ..or anything surrounded by double quotes
      | \[\S+\sTO\s\S+\]     # ..or anything like [x TO y]
      | [^\s"'\[]+           # ..or any string not containing whitespace or special chars
    )
  )
/xi 

Working with this gave me a sinking feeling. Every time I found an edge case I'd have to amend the regular expression, making it even more complex. In addition, it needed to work in JavaScript as well as Ruby, so certain features like negative lookbehind weren't available.

...It was about this time that the absurdity of all this struck me. The regular expression approach I was using was much more complicated than it would be to write a simple parser from scratch.

Anatomy of a Parser

I'm no expert, but simple parsers are simple. All they do is:

  • Step through a string, character by character
  • Append each character to a buffer
  • When a token-separating condition is encountered, save the buffer to an array and empty it.

Knowing this, we can set up a simple parser that splits strings by whitespace. It's roughly the equivalent to "foo bar".split(/(\s+)/).

class Parser

  WHITESPACE = /\s/
  NON_WHITESPACE = /\S/

  def initialize
    @buffer = []
    @output = []
  end

  def parse(text) 
    text.each_char do |c|
      case c
      when WHITESPACE
        flush if previous.match(NON_WHITESPACE)
        @buffer << c
      else
        flush if previous.match(WHITESPACE)
        @buffer << c
      end
    end

    flush
    @output
  end

  protected

  def flush
    if @buffer.any?
      @output << @buffer.join("")
      @buffer = []
    end
  end

  def previous
    @buffer.last || ""
  end

end


puts Parser.new().parse("foo bar baz").inspect

# Outputs ["foo", " ", "bar", " ", "baz"]

This is a step in the direction of what I want, but it's missing support for quotes and brackets. Fortunately, adding that only takes a few lines of code:

  def parse(text) 

    surround = nil

    text.each_char do |c|
      case c
      when WHITESPACE
        flush if previous.match(NON_WHITESPACE) && !surround
        @buffer << c
      when '"', "'"
        @buffer << c
        if !surround
          surround = c
        elsif surround == c
          flush
          surround = nil
        end
      when "["
        @buffer << c
        surround = c if !surround
      when "]"
        @buffer << c
        if surround == "["
          flush
          surround = nil
        end
      else
        flush() if previous().match(WHITESPACE) && !surround
        @buffer << c
      end
    end

    flush
    @output
  end

This code is only a bit longer than my regular-expression-based approach but is much much more straightforward.

Parting thoughts

There's probably a regular expression out there that would work fine with my use case. If history is any guide, It's probably simple enough to make me look like a fool. :)

But I really enjoyed the chance to write this little parser. It broke me out of the rut I was in with the regex approach. As a nice bonus, I'm a lot more confident in the resulting code than I ever am with code that is based around complicated regular expressions.

What to do next:
  1. Try Honeybadger for FREE
    Honeybadger helps you find and fix errors before your users can even report them. Get set up in minutes and check monitoring off your to-do list.
    Start free trial
    Easy 5-minute setup — No credit card required
  2. Get the Honeybadger newsletter
    Each month we share news, best practices, and stories from the DevOps & monitoring community—exclusively for developers like you.
    author photo

    Starr Horne

    Starr Horne is a Rubyist and Chief JavaScripter at Honeybadger.io. When she's not neck-deep in other people's bugs, she enjoys making furniture with traditional hand-tools, reading history and brewing beer in her garage in Seattle.

    More articles by Starr Horne
    Stop wasting time manually checking logs for errors!

    Try the only application health monitoring tool that allows you to track application errors, uptime, and cron jobs in one simple platform.

    • Know when critical errors occur, and which customers are affected.
    • Respond instantly when your systems go down.
    • Improve the health of your systems over time.
    • Fix problems before your customers can report them!

    As developers ourselves, we hated wasting time tracking down errors—so we built the system we always wanted.

    Honeybadger tracks everything you need and nothing you don't, creating one simple solution to keep your application running and error free so you can do what you do best—release new code. Try it free and see for yourself.

    Start free trial
    Simple 5-minute setup — No credit card required

    Learn more

    "We've looked at a lot of error management systems. Honeybadger is head and shoulders above the rest and somehow gets better with every new release."
    — Michael Smith, Cofounder & CTO of YvesBlue

    Honeybadger is trusted by top companies like:

    “Everyone is in love with Honeybadger ... the UI is spot on.”
    Molly Struve, Sr. Site Reliability Engineer, Netflix
    Start free trial
    Are you using Sentry, Rollbar, Bugsnag, or Airbrake for your monitoring? Honeybadger includes error tracking with a whole suite of amazing monitoring tools — all for probably less than you're paying now. Discover why so many companies are switching to Honeybadger here.
    Start free trial