Fastest way to profile a method in Ruby

over 6 years ago

Lately I needed to add some instruments into Discourse. I wanted every request to measure:

How many SQL statements and Redis commands are executed?
How much time was spent in those sections?

This can easily be abstracted to a more generalize problem of

class ExternalClass
   def method_to_measure
   end
end

Solving this problem breaks down into a few particular sub-problems

What is the fastest way of measuring elapsed time in a method?
What is the fastest patch to a method that can add instrumentation?
How does one account for multithreading?

Fastest way to measure a method

The first and simplest problem to solve is pretending we do not have to patch anything.

If we had access to all the source and could add all the instruments we wanted to, what is the quickest way of measuring elapsed time?

There are two techniques we can use in Ruby.

Create a Time object before and after and delta.
Use Process.clock_gettime to return a Float before and after and delta.

Let’s measure:

require 'benchmark/ips'

class Test
  attr_reader :count, :duration

  def initialize
    @count = 0
    @duration = 0.0
  end

  def work
  end

  def method
    work
  end

  def time_method
    @count += 1
    t = Time.now
    work
  ensure
    @duration += Time.now - t
  end

  def process_clock_get_time_method
    @count += 1
    t = Process.clock_gettime(Process::CLOCK_MONOTONIC)
    work
  ensure
    @duration += Process.clock_gettime(Process::CLOCK_MONOTONIC) - t
  end
end


t = Test.new

Benchmark.ips do |b|
  b.report "method" do |times|
    i = 0
    while i < times
      t.method
      i += 1
    end
  end

  b.report "time_method" do |times|
    i = 0
    while i < times
      t.time_method
      i += 1
    end
  end

  b.report "process_clock_get_time_method" do |times|
    i = 0
    while i < times
      t.process_clock_get_time_method
      i += 1
    end
  end

end

# Calculating -------------------------------------
#              method     19.623M (± 3.5%) i/s -     98.227M in   5.012204s
#         time_method      1.596M (± 1.2%) i/s -      8.061M in   5.050321s
# process_clock_get_time_method
#                          4.972M (± 1.7%) i/s -     24.908M in   5.011634s

As expected Process.clock_gettime(Process::CLOCK_MONOTONIC) is fastest cause no ruby object is allocated and later garbage collected, it also happens to be the most accurate way of measuring duration.

So let’s use that from now.

A note on “not doing work”

Our method above was doing no work, if we start doing work, the difference will be much less observable.

For example, if I replace work with $redis.get('x') which is pretty much the shortest time I want to instrument I will get:

              method     17.567k (± 3.0%) i/s -     88.230k in   5.027068s
         time_method     16.730k (± 3.2%) i/s -     84.864k in   5.077868s
process_clock_get_time_method
                         17.112k (± 2.7%) i/s -     85.884k in   5.022470s

So we have 17.5 vs 17.11 vs 16.7 thousand redis gets a second. Fastest being uninstrumented, slowest being using Time.now.

This is not a HUGE difference like the original benchmark, but you can not argue that Time.now is a tax you want to pay when an faster and safer alternative exists.

Fastest way to “patch” a method

When instrumenting code we do not own, we do not have the luxury of making source level changes.

This leaves us with 3 alternatives:

Using prepend to prepend a module in the class and override the original method
Using alias_method and chaining our new method around the original
Replacing the method with a “patched” version of the method

Replacing the method is fraught with risk, cause once the original method in a 3rd party library changes we need to fix our replacement. It is unmaintainable long term. An interesting option that may pop up later is outputting the instruction sequence using RubyVM::InstructionSequence#of , patching the instruction sequence and then loading up the method. This will clearly be the fastest way of patching, but is not quite available yet cause instruction sequence manipulation is still a bit of a dark art.

So let’s compare prepend with alias_method, what is going to be faster?

require 'benchmark/ips'
#require 'redis'

$redis = Redis.new

class Test
  attr_reader :count

  def initialize
    @count = 0
  end

  def work
  #  $redis.get 'X'
  end

  def method
    work
  end

  def method_with_prepend
    work
  end

  def method_with_alias_method
    work
  end

  def method_with_manual_instrument
    @count += 1
    work
  end
end

module PrependInstrument
  def method_with_prepend
    @count += 1
    super
  end
end

class Test; prepend PrependInstrument; end

class Test
  alias_method :method_with_alias_method_orig, :method_with_alias_method

  def method_with_alias_method
    @count += 1
    method_with_alias_method_orig
  end
end

t = Test.new

Benchmark.ips do |b|
  b.report "method" do |times|
    i = 0
    while i < times
      t.method
      i += 1
    end
  end

  b.report "method_with_prepend" do |times|
    i = 0
    while i < times
      t.method_with_prepend
      i += 1
    end
  end

  b.report "method_with_alias_method" do |times|
    i = 0
    while i < times
      t.method_with_alias_method
      i += 1
    end
  end

  b.report "method_with_manual_instrument" do |times|
    i = 0
    while i < times
      t.method_with_manual_instrument
      i += 1
    end
  end
end


#               method     20.403M (± 1.6%) i/s -    102.084M in   5.004684s
#  method_with_prepend     10.339M (± 1.5%) i/s -     51.777M in   5.009321s
# method_with_alias_method
#                         13.067M (± 1.8%) i/s -     65.649M in   5.025786s
# method_with_manual_instrument
#                         16.581M (± 1.6%) i/s -     83.145M in   5.015797s

So it looks like the very elegant prepend method of patching methods is slower than the old school alias method trick.

However once you introduce work the difference is far more marginal, this is the same bench with $redis.get "x" as the unit of work

              method     17.259k (± 3.1%) i/s -     86.892k in   5.039493s
 method_with_prepend     17.056k (± 2.8%) i/s -     85.782k in   5.033548s
method_with_alias_method
                         17.464k (± 2.4%) i/s -     87.464k in   5.011238s
method_with_manual_instrument
                         17.369k (± 2.8%) i/s -     87.204k in   5.024699s

Performance is so close it is hard to notice anything.

Accounting for threading

One big real world concern I had for my use case is ensuring I only measure work from my current thread.

In Discourse some background threads could be running doing deferred work and I do not want that to interfere with the measurements on the thread servicing the web. In a Puma based environment this is supercritical cause lots of threads could be accessing the same global thread safe Redis connection, you may also want to “enable” instrumenting prior to having a connection open.

So we can start by measuring the impact of thread safety.

require 'benchmark/ips'
require 'redis'

$redis = Redis.new

class Test
  attr_reader :count

  def initialize
    @count = 0
    @thread = Thread.current

    data = Struct.new(:count).new
    data.count = 0

    Thread.current["prof"] = data
  end

  def work
    # $redis.get 'X'
  end

  def method
    work
  end

  def method_with_safety
    @count += 1 if Thread.current == @thread
    work
  end

  def method_with_thread_storage
    if data = Thread.current["count"]
      data.count += 1
    end
    work
  end

end


t = Test.new

Benchmark.ips do |b|
  b.report "method" do |times|
    i = 0
    while i < times
      t.method
      i += 1
    end
  end

  b.report "method_with_safety" do |times|
    i = 0
    while i < times
      t.method_with_safety
      i += 1
    end
  end

  b.report "method_with_thread_storage" do |times|
    i = 0
    while i < times
      t.method_with_thread_storage
      i += 1
    end
  end

end

There are 2 mechanisms we can account for thread safety:

Only run instrumentation if Thread.current is the thread being profiled
Store instrumentation data on Thread.current (note an external registry is no faster)

We can see the faster method is using a Thread.current comparison. However, once you start doing work the difference is quite marginal. Safety comes at a cost but it is not a ridiculous cost.

### NO WORK
              method     20.119M (± 3.1%) i/s -    100.579M in   5.004172s
  method_with_safety     11.254M (± 3.0%) i/s -     56.509M in   5.025966s
method_with_thread_storage
                          6.105M (± 3.5%) i/s -     30.559M in   5.011887s

With $redis.get 'x'

              method     18.007k (± 4.8%) i/s -     91.052k in   5.069131s
  method_with_safety     18.081k (± 2.9%) i/s -     91.156k in   5.045869s
method_with_thread_storage
                         18.042k (± 2.8%) i/s -     90.532k in   5.021889s

A generic method profiler

So now, we can piece all of the bits together use the fastest micro benches and build up a general method profiler we can use to gather metrics.

# called once to patch in instrumentation
MethodProfiler.patch(SomeClass, [:method1, :method2], :name)

SomeClass.new.method1
SomeClass.new.method2

# called when starting profiling
MethodProfiler.start
result = MethodProfiler.stop

{ 
   total_duration: 0.2,
   name: {duration: 0.111, calls: 12}
}

Which can be implemented with:

require 'benchmark/ips'
require 'redis'
class MethodProfiler
  def self.patch(klass, methods, name)
    patches = methods.map do |method_name|
      <<~RUBY
      unless defined?(#{method_name}__mp_unpatched)
        alias_method :#{method_name}__mp_unpatched, :#{method_name}
        def #{method_name}(*args, &blk)
          unless prof = Thread.current[:_method_profiler]
            return #{method_name}__mp_unpatched(*args, &blk)
          end
          begin
            start = Process.clock_gettime(Process::CLOCK_MONOTONIC)
            #{method_name}__mp_unpatched(*args, &blk)
          ensure
            data = (prof[:#{name}] ||= {duration: 0.0, calls: 0})
            data[:duration] += Process.clock_gettime(Process::CLOCK_MONOTONIC) - start
            data[:calls] += 1
          end
        end
      end
      RUBY
    end.join("\n")

    klass.class_eval patches
  end

  def self.start
    Thread.current[:_method_profiler] = {
      __start: Process.clock_gettime(Process::CLOCK_MONOTONIC)
    }
  end

  def self.stop
    finish = Process.clock_gettime(Process::CLOCK_MONOTONIC)
    if data = Thread.current[:_method_profiler]
      Thread.current[:_method_profiler] = nil
      start = data.delete(:__start)
      data[:total_duration] = finish - start
      data
    end
  end
end

$redis = Redis.new

class Test
  def work
    #$redis.get 'X'
  end

  def method
    work
  end

  def method_unpatched
    work
  end
end


MethodProfiler.patch(Test, [:method], :a_thing)
MethodProfiler.start

t = Test.new

Benchmark.ips do |b|
  b.report "method" do |times|
    i = 0
    while i < times
      t.method
      i += 1
    end
  end

  b.report "method_unpatched" do |times|
    i = 0
    while i < times
      t.method_unpatched
      i += 1
    end
  end

end

Link to Gist

Performance wish there is a definite impact for the “no work scenario”

(while profiling)

              method      2.247M (± 1.6%) i/s -     11.370M in   5.061768s
    method_unpatched     20.261M (± 3.5%) i/s -    101.226M in   5.002638s

(without profiling)

              method      6.015M (± 2.9%) i/s -     30.156M in   5.017655s
    method_unpatched     20.211M (± 2.1%) i/s -    101.288M in   5.013976s

However once we actually start doing work with say $redis.get 'x'

(while profiling)
              method     17.419k (± 3.5%) i/s -     87.108k in   5.006821s
    method_unpatched     17.627k (± 7.3%) i/s -     88.740k in   5.063757s


(without profiling)

              method     17.293k (± 3.1%) i/s -     87.672k in   5.074642s
    method_unpatched     17.478k (± 2.4%) i/s -     88.556k in   5.069598s

Conclusion

Building on all the micro benchmarks we can design a generic class for profiling methods quite efficiently in Ruby with very minimal runtime impact.

Feel free to borrow any of the code in this article with or without attribution and build upon it.

Posted by: Sam Permalink | Comments (1)

Comments

tayler over 6 years ago

When I’m able to run a code, I profile it like this

many = 100_000
start = Time.now
many.times do
  t.method
end
puts Time.now - start

Sam Saffron