Debugging 100% CPU usage in production Ruby on Rails systems

over 6 years ago

Today I noticed one of our customer containers was running really high on CPU.

# top -bn1 

190 discour+  20   0 2266292 205128  15656 S  86.7  0.3   9377:35 ruby

# ps aux

discour+   190 19.4  0.3 2266292 207096 ?      Sl    2017 9364:38 sidekiq 5.0.5 discourse [1 of 5 busy]

Looks like sidekiq is stuck on a job.

Where is it stuck?

Usually, this is where the story ends and another series of questions start

Can we reproduce this on staging or development?
What code changed recently?
Why is perf trace not giving me anything I can work with?
How awesome is my Sidekiq logging?
Where is my divining rod?

Julia Evans is working on a fantastic profiler that will allow us to answer this kind of question real quick. But in the meantime, is there anything we can do?

rbtrace + stackprof

At Discourse we include rbtrace and stackprof in our Gemfile.

gem 'rbtrace'
gem 'stackprof', require: false

We always load up rbtrace in production, it allows us a large variety of production level debugging. stackprof is loaded on-demand.

In this particular case I simply run:

# rbtrace -p 190 -e 'Thread.new{ require "stackprof"; StackProf.start(mode: :cpu); sleep 2; StackProf.stop; StackProf.results("/tmp/perf"); }'

This injects a new thread that enables stackprof globally and finally writes the performance data to /tmp/perf

This dump can easily be analyzed:

# stackprof /tmp/perf
==================================
  Mode: cpu(1000)
  Samples: 475 (0.63% miss rate)
  GC: 0 (0.00%)
==================================
     TOTAL    (pct)     SAMPLES    (pct)     FRAME
       473  (99.6%)         473  (99.6%)     #<Class:0x00007fac08218020>::Typography.to_html_french
         4   (0.8%)           1   (0.2%)     #<Module:0x00007fac080c3620>.reap_connections

In fact, we can even collect backtraces and generate flamegraphs in stackprof, so I strongly recommend reading through the readme.

So there you have it, my culprit is:

text.gsub(/(\s|)+([!?;]+(\s|\z))/, '&thinsp;\2\3')

A regular expression consuming 100% CPU in production, this has got to be a first

Posted by: Sam Permalink | Comments (13)

Comments

Sam Saffron about 6 years ago

Some people asked me what was the text that exploded the regex and how did I fix it:

A repro is here

gist.github.com

https://gist.github.com/SamSaffron/e2e1f7e1204c59292660cdfd0c2ea850

regex.rb

DATA.read.gsub(/(\s|)+([!?;]+(\s|\z))/, '&thinsp;\2\3')


__END__

GET /wp-login.php HTTP/1.1                                                          69 
GET /show.aspx HTTP/1.1                                                             15

For my part I removed the offending code for now. I am not even entirely sure what my customer was trying to do with the regex.

Kevin Vo about 6 years ago

Can you guys help me? Do you know what’s wrong?

hello_world$ rbtrace -p 14369 -e “Thread.new{ require ‘stackprof’; StackProf.start(mode: :cpu); sleep 2; StackProf.stop;”
Error: argument --pid (pid is not listening for messages, did you require "rbtrace").

Юрий Соколов about 6 years ago

Equivalent regexp is gsub(/\s*([!?;]+(\s|\z))/, ' \1\2') . I think it should be faster, though I didn’t check. By the way, way it double last capture (space or end)? It is already part of surrounding capture.

Sam Saffron about 6 years ago

My guess here is that you did not include rbtrace in your Gemfile. You got to add it and re-deploy.

Sam Saffron about 6 years ago

Thanks, honestly I am not sure why this was done in the first place. It is weird to change raw markdown that people enter into the composer and this regex was trying to do that. The correct way would be to enforce strange typography when you convert the raw markdown to cooked html.

Maciek about 6 years ago

Was the regex put there by a developer (in the code) or it was provided by a user (from a webpage and then loaded from the DB)?

Sam Saffron about 6 years ago

It was provided by a developer, it was part of a customer specific plugin.

Lukáš Zapletal about 6 years ago

Nice. Just for the record, on Red Hat systems you can use SystemTap too. Pretty neat:

You must use system or SCL Ruby tho, wont work with RVM/RBENV compiled (unpatched) versions.

Sam Saffron about 6 years ago

Hmmm is this just using the dtrace stuff? Can it not be compiled by hand somehow? What are they doing in SCL Ruby?

Lukáš Zapletal about 6 years ago

Yeah, SystemTap is essentialy dtrace clone used in Fedora and Red Hat clones and it is compatible with dtrace to some degree (I haven’t tried myself - it can somehow use dtrace points). Currently Ruby in Fedora and RHEL ships with a “tapset” file which enables the integration - there is no need of patching Ruby if I understand this correctly. It should work with upstream Ruby if you have SystemTap enabled kernel and tooling I believe.

In case you are interested, I’ve built a tool around SystemTap that allows you to have “top-like” experience for ad-hoc debug sessions. It’s just a shell wrapper around some examples from my blogpost: GitHub - lzap/foreman-tracer: Foreman SystemTap tracing utility

The stp file for current Fedora version is at https://src.fedoraproject.org/cgit/rpms/ruby.git/tree/libruby.stp

Edit: After some digging, I think that this tapset leverages dtrace hooks in Ruby upstream, thus no patches needed. Anyway the key thing is - SystemTap is present and enabled by default on Fedora, RHEL, CentOS and others. That’s super awesome. Not sure if you guys use non-redhat distros or custom kernel tho. I find this cool - no changes needed to investigate performance issues

Dan Benjamin about 6 years ago

Hey Sam, this is great.
When trying to run this locally I get the following error. Have any idea whats its about:

➜  ~ ps ux | grep ruby
user 99397   0.0  0.9  4446480 154392 s001  S+    1:40PM   0:10.81 /Users/user/.rbenv/versions/2.4.1/bin/ruby bin/rails c

➜  core git:(development) ✗ rbtrace -p 99397 -e 'Thread.new{ require "stackprof"; StackProf.start(mode: :cpu); sleep 2; StackProf.stop; StackProf.results("/tmp/perf"); }'
*** attached to process 99397
*** detached from process 99397
/Users/user/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/rbtrace-0.4.10/lib/rbtrace/rbtracer.rb:324:in `send_cmd': command is too long (ArgumentError)
	from /Users/user/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/rbtrace-0.4.10/lib/rbtrace/rbtracer.rb:146:in `eval'
	from /Users/user/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/rbtrace-0.4.10/lib/rbtrace/cli.rb:506:in `run'
	from /Users/user/.rbenv/versions/2.4.1/lib/ruby/gems/2.4.0/gems/rbtrace-0.4.10/bin/rbtrace:5:in `<top (required)>'
	from /Users/user/.rbenv/versions/2.4.1/bin/rbtrace:23:in `load'
	from /Users/user/.rbenv/versions/2.4.1/bin/rbtrace:23:in `<main>'

Sam Saffron about 6 years ago

We are super restricted on the amount of buffer we can pass to rbtrace (something I would like to fix) can you try squashing spaces there to see if it starts working?

'Thread.new{require "stackprof";StackProf.start(mode: :cpu);sleep 2;StackProf.stop;StackProf.results("/tmp/perf");}'

Yossi Shasho over 4 years ago

Actually this command was still too long for me. This is how I finally got it to work:

rbtrace -p PID -e 'Thread.new{require "stackprof";s = StackProf;s.start(mode: :cpu);sleep 2;s.stop;s.results("/tmp/perf");}'

Sam Saffron