Debugging MySQL lock errors in Ruby on Rails

New week, new bug to hunt down! A day after our last deployment, we encountered a new type of error that started occurring randomly. Certain SQL update queries were failing with the following error:

ActiveRecord::StatementInvalid: Mysql2::Error: Lock wait timeout exceeded; try restarting transaction: UPDATE users SET ...

enter image description here

At first glance, it seemed like something locked our DB longer than the default lock timeout of 50 seconds in MySQL.

It was strange since we haven’t deployed anything recently that could have inspired such a behaviour.

Nevertheless, the situation was getting critical, so I had to address it as a top priority.

The theory

To ensure data consistency, MySQL applies a set of complex rules for locking. It’s possible for more rows to be locked than one might expect, especially when there is a lack of relevant database indexes.

For example, the following Rails code performs an update that locks the entire table if the name field doesn’t have an index:

User.where(name: '...').update_all(...)

Explicit database transactions in Rails, like SomeModel.transaction {...}, are also potential triggers for database locks.
Records updated in a transaction block are locked, and if another part of the application attempts to update the same locked data, MySQL will wait for 50 seconds before it throws a lock timeout error.

I had a hunch that our legacy application was performing a significant amount of work within DB transactions, such as making HTTP calls to other services and so on. If just one service took longer than expected, it could very well be the cause of our issue.

The practice

MySQL has a Lock Monitor that can provide valuable insight into the cause of a lock. However, it needs to be enabled or extracted somehow, and unfortunately, it wasn’t available in our case.

Since waiting for the DevOps team to turn it on wasn’t an option, I decided to do some debugging on my own.

First, I measured the runtime of our transactions and logged the results:

# config/initializers/monkey_patches.rb
module CoreExtension
  module ActiveRecord
    module Base
      def transaction(*args)
        backtrace_cleaner = ActiveSupport::BacktraceCleaner.new

        filters = [
          'gems'
        ]

        backtrace_cleaner.add_silencer do |line|
          filters.any? { |filter| line.include?(filter) }
        end

        clean_backtrace = backtrace_cleaner.clean(caller)

        transaction_id = SecureRandom.uuid

        Rails.logger.info("[DBTransaction] start ##{transaction_id} #{clean_backtrace.to_json}")
        start = Time.now.to_i

        super
      ensure
        elapsed_seconds = Time.now.to_i - start

        Rails.logger.info("[DBTransaction] finished in more than 50s ##{transaction_id}: #{elapsed_seconds}s #{clean_backtrace.to_json}") if elapsed_seconds > 50
      end
    end
  end
end

ActiveRecord::Base.singleton_class.prepend CoreExtension::ActiveRecord::Base

This monkey patch wraps every ActiveRecord transaction with logging.

We log the start and end of a transaction, along with the elapsed seconds and a filtered call stack. Since transactions can be nested, it’s ideal to generate a unique ID for each one to be able to match the end with the corresponding start.

Then we can grep the log for long-running transactions:

grep -F '[DBTransaction] finished in more' log/production.log

In our case, a log entry may look like this:

[timestamp][process_id][DBTransaction] finished in more than 50s #transaction-id-xyz: 91s ...

This provides us the transaction ID which we can use to find the start and the end of the transaction.

To list all the log entries between the start and the end, we can use the sed command:

sed -n '/start #transaction-id-xyz/,/#transaction-id-xyz/p' log/production.log

If multiple processes are feeding the same log stream, we need to filter for the corresponding process. In the example above, you can see that we include the process ID in the log entries as [process_id] after the timestamp. Let’s leverage this information along with grep to build the final command:

sed -n '/start #transaction-id-xyz/,/#transaction-id-xyz/p' log/production.log | grep -F [process_id]

The solution

After a thorough investigation, I discovered that the cause of the long-running transaction was an HTTP call.

It turned out that the service was experiencing some issues, resulting in a response time of over a minute.

To address this, I quickly moved the problematic HTTP call into a background job and notified our DevOps team.

Now, the call will be processed asynchronously, allowing the transaction to commit without waiting for the response from the service.

Happy debugging!

Comments