Daniel Doubrovkine bio photo

Daniel Doubrovkine

aka dB., @awscloud, former CTO @artsy, +@vestris, NYC

Email Twitter LinkedIn Github Strava
Creative Commons License

In June 2018 users started reporting slack-side disconnects in their Ruby bots, reported via slack-ruby-client#208, regardless of the concurrency library (Eventmachine or Celluloid) used. I was already observing this behavior in my larger production bots, especially playplay.io with hundreds of teams.

Because this was a new behavior I assumed that the client was not seeing or handling some new type of communication error correctly at the websocket library layer. This has been an issue before, and in the case of Celluloid, IO exceptions had to be handled by emitting a :close event here. That is, the websocket library was seeing a problem and raising an exception, but the client didn’t properly emit a close event, so the bots just sat there waiting for something to happen.

def run_loop
  loop { read }
rescue EOFError, Errno::ECONNRESET, Errno::EPIPE => e
  driver.emit(:close, WebSocket::Driver::CloseEvent.new(1001, 'server closed connection'))
end

The Faye::WebSocket::Client implementation has not seen a similar problem as it deferred reading data to WebSocket::Driver::StreamReader, which seemed to bubble up disconnects properly until these reports.

The server-side disconnects had started happening without any code changes, so naturally I assumed this was new behavior at Slack and opened a ticket on June 13. Slack quickly looped in @aoberoi from the SDKs team to help, but he found nothing wrong on the Slack side.

Bandaids

To relieve the immediate problem I stuck a restart into my bots (eg. slack-strava@bc29324) and began seeing about a dozen of these server-side disconnects every hour.

def cron
  once_and_every 60 * 60 do
    ping_teams!
  end
end

def ping_teams!
  Team.active.each do |team|
    begin
      ping = team.ping!
      next if ping[:presence].online
      logger.warn "DOWN: #{team}"
      after 60 do
        ping = team.ping!
        unless ping[:presence].online
          logger.info "RESTART: #{team}"
          SlackStrava::Service.instance.start!(team)
        end
      end
    rescue StandardError => e
      logger.warn "Error pinging team #{team}, #{e.message}."
    end
  end
end

Gross, but effective.

Async Implementation

Because both the asynchronous library and the websocket implementation were used by thousands of projects, I naturally assumed slack-ruby-client would be more likely to have a bug. The celluloid-io code that I wrote was not detecting server-side disconnects occasionally, therefore we must have not been using celluloid-io correctly in concurrency/celluloid.rb. I summarized my conclusion in celluloid-io#187 and asked for help.

It turned out that Celluloid::IO was effectively unmaintained and I was told to switch to async. While that was a viable idea, replacing the concurrency library could easily create many new bugs without fixing the underlying issue. I wanted to understand the root cause, first.

Websocket Ruby Driver

Interestingly, both Celluloid and EventMachine used websocket-ruby-driver and, since both concurrency libraries exhibited a similar problem, I opened websocket-driver-ruby#61 and asked for ideas. I got a really helpful response from @ioquatix, the author of async.

I didn’t realize the socket read could return nil, meaning something like a disconnect. A read of length 0 returned nil. You also needed to handle empty strings, which may have meant that the client was disconnected. That’s EPIPE, EOFError, ECONNRESET, nil and "".

Adding the code to handle all of the above in slack-ruby-client#216 solved nothing.

What else could go wrong?

There was some speculation that the bug was in Celluloid::IO::SSLSocket. It wasn’t.

Async-Websocket

Having no good ideas I ventured into replacing EventMachine and Celluloid with Async in slack-ruby-client#219 with a lot of help. That solved nothing, as expected, but replaced an unsupported library with a supported one.

Websocket-Level Ping Monitoring

We naturally reached the obvious conclusion that there was no bug and that the protocol required an active heartbeat, and that there was none implemented.

@RodneyU215 from Slack stepped in and added ping monitoring in slack-ruby-client#226 following Slack’s RTM guidelines (see Ping and Pong). The entire discussion and the evolution of that code is worth a read. I notably learned that Ruby Time.now was not monotonic.

I found the final code in slack-ruby-client#226 to be a rather elegant and straightforward example of a complex asynchronous process: a worker sends ping messages to Slack, which responds with a pong captured by the listening process. We periodically check whether a ping has been received within a reasonable timeframe and forcibly abort the connection, causing a reconnect, if it wasn’t.

We further refactored the implementation in slack-ruby-client#262, extracting and improving the keep-alive logic. Finally, slack-ruby-client#279 addressed handling of errors in the ping worker itself preventing it from dying.

Released Libraries

The active ping monitoring was released in slack-ruby-client 0.14.1, slack-ruby-bot 0.12.0 and slack-ruby-bot-server 0.9.0. Subsequent releases all the way up to 0.14.3 continue addressing various newly identified issues.

Thanks

Again, much thanks to Slack’s team for debugging, support and code, including Rodney Urquhart, Ankur Oberoi and Jason Roche, as well as Samuel Williams for his Async library, his adult supervision and for catching tons of non-obvious hard bugs across multiple code reviews.