Why puma workers constantly hung, and how we fixed by discovering the bug of Ruby v2.5.8 and v2.6.6

Published in

ITNEXT

9 min readApr 6, 2021

While running Rails puma servers in production, we were seeing the incident that some old worker processes suddenly got stuck regardless of no change in the amount or trend of requests.

I found out the root cause and reported it to the upstream. This issue still exists in Ruby 2.6.0 and can be found as far back as Ruby 2.5.0. If you just want a summary of the bug, see ruby-lang#17669.

What Happened

One day, we noticed many ActiveRecord::LockWaitTimeout exceptions that the specific users were suffering for a long time. Simultaneously, one of our puma workers reported no capacity of available threads and bloated their memory gradually and infinitely.

Here’s our data from this time period:

Application error logs caused by Lock wait timeouts

Endlessly low thread pool_capacity in Puma.stats

Bloating memory usage of the application container

Initially, we were not confident that these bad observations correlated with each other and triggered by a serious root cause. Each problem itself is practically unavoidable under the heavy traffic. Therefore, we let things run for a while and stepped up carefully and conservatively.

Immediate Response

First, to prevent the memory overflow, I removed this application container from the load balancer through an operation detaching from the k8s service. As coming requests didn’t reach, the memory growth ended too. So far, so good.

❯❯❯ kubectl edit pod api-5c476b576d-vfmgt
spec:
  template:
    metadata:
      labels:
-       app: api
+       app: api-issue-memoryleak

On the flip side, ActiveRecord::LockWaitTimeout exceptions still continued. They were urgent to be resolved since it turned out to be a critical issue for the specific users who couldn’t be available in our main features until we would work it out.

Then, I deleted this container on which a problematic worker process and threads run. Right off the bat, the exceptions ceased, and everything got alright 😌.

❯❯❯ kubectl delete pod api-5c476b576d-vfmgt

Observations

Nothing important to look into how that happened left after the immediate fix as we lost our chances to peek informative snapshots such as our database’s transaction locks and a core dump of the puma worker process. Incidentally, our puma servers relied on MySQL 5.7 as a primary database.

I decided to set them aside next time when a similar issue would come up again.

Preparing fully-tested instructions in advance helped me perform as quickly as possible without a delay to fix this partial disruption. While you can find reliable web resources about how to view the transaction, its holding locks, and the other transactions blocked by the lock, I had trouble generating a core dump and debugging the process running on a GKE cluster.

Then, we waited for another reoccurrence, and it happened soon. Right out of the gate, I could detect long-running transactions by referring to information_schema.innodb_trx table.

SELECT * FROM information_schema.innodb_trx

The table provides information about every transaction currently executing inside `InnoDB.`

Note that it shows two transactions with RUNNING state started over 14 minutes ago. I noticed eachtrx_query is NULL. This column represents the SQL statement that is being executed by the transaction. It implied that some Ruby code left these transactions uncommitted after MySQL already returned its result on time.

On top of that, you need to join information_schema.innodb_lock_waits table together to tell whether these are blocking the other transactions trying to obtain the same locks.

The INNODB_LOCK_WAITS table contains one or more rows for each blocked InnoDB transaction, indicating the lock it has requested and any locks blocking that request.

SELECT
  r.trx_id waiting_trx_id,
  r.trx_mysql_thread_id waiting_thread,
  r.trx_query waiting_query,
  b.trx_id blocking_trx_id,
  b.trx_mysql_thread_id blocking_thread,
  b.trx_query blocking_query
FROM       information_schema.innodb_lock_waits w
INNER JOIN information_schema.innodb_trx b
  ON b.trx_id = w.blocking_trx_id
INNER JOIN information_schema.innodb_trx r
  ON r.trx_id = w.requesting_trx_id;

The table provides information about which transactions are waiting and which transactions are blocking them.

The result made me certain that these two transactions were blocking the other four transactions and triggered ActiveRecord::LockWaitTimeout exceptions. You can see the values of blocking_trx_id match ones of long-running transactions.

Also, puma’s access logs written after receiving a TERM signal had some requests with exceptionally long latency, manifested as being stuck.

These 14 requests lasted for an hour and fifty minutes 😲

All things considered from the preceding observations, I concluded that the unexpected block of Rails requests was the root problem. The MySQL transactions issued along the way were also lingering and blocking the other until the worker thread was terminated.

The location where the puma thread hung

To demystify the root cause, delve into a core dump to locate where the thread stuck. Here’s how to generate a core dump in GKE.

First, I logged in to the compute engine instance running that puma through ssh.

❯❯❯ gcloud beta compute ssh --zone "asia-northeast1-a" "gke-cluster-n1-standard-8-po-a0cea807-5327" --project "hogehoge"

Then, nsenter allows you to execute gdb in the Pod’s namespace specified by the PID of the worker process (3866) ps command showed. Running gdb in the container requires entering the mount, network, and PID namespace of the target process.

$ ps auxf
$ sudo nsenter --mount --net --pid --target 3866 bash

gcore generates a core file of a running program(PID: 23).

bash-5.0# apk add gdb
bash-5.0# gcore 23

Lastly, I copied/downloaded the core file from the container to my local.

❯❯❯ kubectl cp api-5dcb4b66f8-99nft:/api/core.23 ~/react/tmp/core.23

Now to get the thread backtraces from the core dump, run these commands:

bash-5.0# apk add musl-dbg gdb curl
bash-5.0# gdb -silent -nw /usr/local/bin/ruby core.23(gdb) info threads
  Id   Target Id                  Frame
...
  6    LWP 32 "puma 001"          __cp_end () at src/thread/x86_64/syscall_cp.s:29
  7    LWP 33 "puma 002"          __cp_end () at src/thread/x86_64/syscall_cp.s:29
  8    LWP 35 "puma 003"          __cp_end () at src/thread/x86_64/syscall_cp.s:29
  9    LWP 36 "puma 004"          __cp_end () at src/thread/x86_64/syscall_cp.s:29
  10   LWP 37 "puma 005"          __cp_end () at src/thread/x86_64/syscall_cp.s:29
  11   LWP 39 "puma 006"          __cp_end () at src/thread/x86_64/syscall_cp.s:29
  12   LWP 40 "puma 007"          __cp_end () at src/thread/x86_64/syscall_cp.s:29
  13   LWP 41 "puma 008"          __cp_end () at src/thread/x86_64/syscall_cp.s:29
  14   LWP 42 "puma 009"          __cp_end () at src/thread/x86_64/syscall_cp.s:29
  15   LWP 43 "puma 010"          __cp_end () at src/thread/x86_64/syscall_cp.s:29
  16   LWP 44 "puma 011"          __cp_end () at src/thread/x86_64/syscall_cp.s:29
  17   LWP 45 "puma 012"          __cp_end () at src/thread/x86_64/syscall_cp.s:29
  18   LWP 46 "puma 013"          __cp_end () at src/thread/x86_64/syscall_cp.s:29
  19   LWP 47 "puma 014"          __cp_end () at src/thread/x86_64/syscall_cp.s:29
  20   LWP 48 "puma 015"          __cp_end () at src/thread/x86_64/syscall_cp.s:29
...

We saw 15 puma threads processing requests. Here’s the first thread:

Note that this is a C-level backtrace. Although this backtrace is helpful to guess something like a deadlock issue, it seemed to have less information to locate the root problem. The fact we couldn’t find out anything from a C-level backtrace indicated at this point that the Ruby code layer might be the reason instead.

So, let’s get a Ruby-level backtrace. Before that, we need to set up a gdbinit file to define some useful functions maintained in a Ruby official repository.

bash-5.0# apk add curl
bash-5.0# curl -o .gdbinit https://raw.githubusercontent.com/ruby/ruby/ruby_2_5/.gdbinit
bash-5.0# mv .gdbinit /root/

Calling rb_ps dumps a Ruby-level backtrace, whether attaching to a live process or a core file.

(gdb) rb_ps
$1 = (rb_vm_t *) 0x7ff149353700
...
* #<Thread:0x56327568dfa0 rb_thread_t:0x56327646b020 native_thread:0x7fd3700dfb20>
147:0x7fd3739ca020 <rb_mutex_lock at thread_sync.c:326>:in `lock'
146:/usr/local/lib/ruby/2.5.0/monitor.rb:192:in `mon_enter'
145:/usr/local/lib/ruby/2.5.0/monitor.rb:233:in `block in mon_synchronize'
144:0x7fd3739c7df0 <rb_thread_s_handle_interrupt at vm_core.h:1622>:in `handle_interrupt'
143:/usr/local/lib/ruby/2.5.0/monitor.rb:233:in `mon_synchronize'
142:/usr/local/lib/ruby/2.5.0/logger.rb:688:in `write'
141:/usr/local/lib/ruby/2.5.0/logger.rb:471:in `add'
140:/cocalero-api/vendor/bundle/ruby/2.5.0/gems/activesupport-6.0.1/lib/active_support/logger_thread_safe_level.rb:53:in `add'
139:/usr/local/lib/ruby/2.5.0/logger.rb:527:in `info'
138:/cocalero-api/vendor/bundle/ruby/2.5.0/gems/activesupport-6.0.1/lib/active_support/log_subscriber.rb:119:in `info'
137:/cocalero-api/vendor/bundle/ruby/2.5.0/gems/activejob-6.0.1/lib/active_job/logging.rb:57:in `enqueue'
...

Now, we found the root issue: all the hung threads were waiting for the monitor/mutex lock with that exact same backtrace.

https://github.com/ruby/ruby/blob/ruby_2_5/lib/monitor.rb#L192

This lock was never unlocked for some reason. I expected that the thread list would have a problematic thread holding the lock, but actually, I couldn’t find it. To conclude, gdb showed us a location where the thread hung, but not the exact thread that caused this situation.

How to mitigate the problem

According to the Ruby-level backtrace, calling Rails.logger during the MySQL transaction can trigger ActiveRecord::LockWaitTimeout exceptions as blocked threads are holding the MySQL lock. The method write is built on Monitor#synchronize to protect a race condition around an output device.

https://github.com/ruby/ruby/blob/ruby_2_5/lib/logger.rb#L688

Then, it turned out that we had enqueued a Sidekiq job as an ActiveRecord callback within the transaction, and this enqueuing actually entails an act of logging. Basically, it’s an unsafe operation to perform an extra I/O during the transaction.

We fixed to enqueue a job after committing a database transaction to mitigate the problem for now.

- after_destroy :delete_session, :enqueue_notification_job
+ after_destroy :delete_session
+ after_destroy_commit :enqueue_notification_job

Now, this mitigation worked to prevent serious errors like ActiveRecord::LockWaitTimeout exceptions 🎉. Some puma workers still suffered out of available threads. But it gave me more time to dive into a further investigation without a rush.

What’s the root cause

I had suspected Timeout.timeout as a tricky mechanism possible to break the internal state that must be consistent in the first place. Actually, this incident began on the day we introduced an additional call to Timeout.timeout.

Note that Timeout.timeout spawns a new thread dedicated to run a given block and killed it forcefully by sending an exception if it hasn’t finished in a fixed amount of time.
As many resources warned, this behavior is inherently possible to corrupt an application-level state. However, Ruby provides a way to implement a critical section that must be mutually exclusive. It’s the Monitor class we mentioned at the backtrace. Usually, almost all reliable libraries rely on this class and protect their invariant properly. Nevertheless, we got a deadlock/race condition issue, and thus it’s the tricky problem we tried to wrap our head around at that time.

require 'monitor'

lock = Monitor.new
lock.synchronize do
  # exclusive access
end

Although I was positive that easy workarounds such as making timeout longer or stopping a call itself would resolve it, I found it not interesting and thus spent enough time on the investigation as a learning experience instead.

That said, the backtrace suggested a possibility that a thread obtaining a mutex was suddenly terminated as it failed to be done within a timeout. Indeed, Timeout::Error exception has a backtrace showing where the block was forcefully left off. Add a line to print a new error log:

rescue Timeout::Error => e
+  Rails.logger.error("timeout occured.\n#{e.message}\n#{e.backtrace.join("\n")}")
end

Here’s the excerpt of e.backtrace:

timeout occured.
/usr/local/lib/ruby/2.5.0/monitor.rb:233:in `handle_interrupt' 
/usr/local/lib/ruby/2.5.0/monitor.rb:233:in `mon_synchronize' /usr/local/lib/ruby/2.5.0/logger.rb:688:in `write' /usr/local/lib/ruby/2.5.0/logger.rb:471:in `add' /cocalero-api/vendor/bundle/ruby/2.5.0/gems/activesupport-6.0.1/lib/active_support/logger_thread_safe_level.rb:53:in `add' /usr/local/lib/ruby/2.5.0/logger.rb:527:in `info'

Finally, we found the root cause. If Timeout::Error is raised when the thread is executing right before the begin block,

monitor.rb:233:in `handle_interrupt’

it breaks the state of the monitor and causes deadlock even though the source code comment denies it. It means there’s a possibility that a thread doesn’t call mon_exit after executing mon_enter. This violates the invariant to make sure that a thread unlocks the monitor it had locked by itself before.

Here’s the way to reproduce:

Reproducible Script

Result

I also confirmed that it’s never reproduced after applying the below patch to Ruby’s monitor.rb file.

Patch to fix this

How to avoid this bug

Updating your Ruby version to either 2.7.2 or 3.0.0 can fix this issue completely because the monitor was reimplemented in C in these versions.

Otherwise, you can manually unlock the monitor in an unsophisticated way when Timeout::Error exception occurs. But it may be too ad-hoc to be adopted in production. Here’s the sample code:

rescue Timeout::Error => e
  logdev = Rails.logger.instance_variable_get(:@logdev)
  if logdev.mon_owned?
    logdev.mon_exit
  end
end

Wrapping up

Ruby v2.5.8 and v2.6.6 have a bug causing a deadlock triggered by Timeout.timeout. The upstream of v2.6 has not yet fixed at the moment(2021–04–04).

I identified this bug by collecting a core dump, reading a source code in Ruby and our application, and analyzing backtraces using GDB.