We Made Puma Faster With Sleep Sort

by Nate Berkopec (@nateberkopec) of (who?), a Rails performance consultancy.

Summary: Puma 5 is a huge major release for the project. It brings several new experimental performance features, along with tons of bugfixes and features. Let's talk about some of the most important ones. (1839 words / 7 minutes)

Puma 5 (codename Spoony Bard¹(When Puma gets a new ‘supercontributor’ that submits lots of important work to the project, we let them name the next release. This release features a lot of code from Will Jordan, who named this release ‘Spoony Bard’. Will said: ‘Final Fantasy IV is especially nostalgic for me, the first big open-source project I ever worked on was a fan re-translation of the game back in the late 90s.’)¹ When Puma gets a new ‘supercontributor’ that submits lots of important work to the project, we let them name the next release. This release features a lot of code from Will Jordan, who named this release ‘Spoony Bard’. Will said: ‘Final Fantasy IV is especially nostalgic for me, the first big open-source project I ever worked on was a fan re-translation of the game back in the late 90s.’) was released today (my birthday!). There’s a lot going on in this release, so I wanted to talk about the different features and changes to give Puma users confidence in upgrading.

Experimental Performance Features For Cluster Mode on MRI

This is probably the headline of the release - two features for reducing memory usage, and one for reducing latency.

Puma 5 contains 3 new experimental performance features:

wait_for_less_busy_worker config. This may reduce latency on MRI through inserting a small delay (sleep sort!) before re-listening on the socket if worker is busy. Intended result: If enabled, should reduce latency in high-load (>50% utilization) Puma clusters.
fork_worker option and refork command for reduced memory usage by forking from a worker process instead of the master process. Intended result: If enabled, should reduce memory usage.
Added nakayoshi_fork config option. Reduce memory usage in preloaded cluster-mode apps by GCing before fork and compacting, where available. Intended result: If enabled, should reduce memory usage.

All of these experiments are only for cluster mode Puma configs running on MRI.

We’re calling them experimental because we’re not sure if they’ll actually have any benefit. We’re pretty sure they’re stable and won’t break anything, but we’re not sure they’re actually going to have big benefits in the real world. People’s workloads are often not what we anticipate, and synthetic benchmarks are usually not of any help in figuring out if a change will be beneficial or not.

We do not believe any of the new features will have a negative effect or impact the stability of your application. This is either a “it works” or “it does nothing” experiment.

If any of the features turn out to be particularly beneficial, we may make them defaults in future versions of Puma.

If you upgrade and try any of the 3 new features, please post before and after results or screenshots to this Github issue. “It didn’t do anything” is still a useful report in this case. Posting ~24 hours of “before” and ~24 hours of “after” data would be most helpful.

wait_for_less_busy_worker: sleep sort for faster apps?!

This feature was contributed to Puma by Gitlab. Turn it on by adding wait_for_less_busy_worker to your Puma config.

When a request comes in to a Puma cluster, the operating system randomly selects a listening, free Puma worker process to pick up the request. “Listening” and “free” being the key words - a Puma process will only listen to the socket (and pick up more requests) if it has nothing else to do. However, when running Puma with multiple threads, Puma will also listen on the socket when all of its busy threads are waiting on I/O or have otherwise released the Global VM Lock.

When Gitlab investigated switching from Unicorn to Puma, they encountered an issue with this behavior. Under high load with moderate thread settings (a max pool size of 5 in their case), average request latency increased. Why?

Remember, I said that the operating system randomly assigns a request to a listening worker process. So, it will never send a request to a worker process that’s busy doing other things, but what about a worker process that’s got 4 threads that are processing other requests, but all 4 of those threads happen to be waiting on I/O right now?

Imagine a Puma cluster with 3 workers:

Worker 1: 0/5 threads busy.
Worker 2: 1/5 threads busy.
Worker 3: 4/5 threads busy.

If Worker 3’s 4 active threads happen to all have released the GVL, allowing that worker to listen to the socket, and a new request comes in - which worker process should we assign the request to, ideally? Worker 1, right? Unfortunately, most operating systems will assign the request to Worker 3 33% of the time.

So, what do we do? We want the operating system to prefer less-loaded workers. It would be really cool if we could sort the list of workers listening on the socket so that the operating system would give requests to the least-loaded worker. Well, we can’t really do that easily, but we can do something else.

wait_for_less_busy_worker causes a worker to wait to re-listen on the socket if it’s thread pool isn’t completely empty. This means that in high-load scenarios, the operating system will assign requests to less-loaded workers.

This is basically sleep-sorting our workers. We’re kind of doing doing this:

[].tap { |a| workers.map { |e| Thread.new{ sleep worker_busyness.to_f/1000; a << e} }.each{|t| t.join} }

… and hiding “more loaded” workers from the operating system by letting less-loaded workers listen first!

Originally the proposal was for a more complicated sort - processes slept longer if they had more busy threads - but that was removed when it was found that a simpler on/off sleep was just as effective.

The net effect is that in high-load scenarios, request latency decreases. This is because workers with more busy threads are slower than workers with no busy threads. We’re assuring that requests get assigned to the faster workers. Prior to this patch, Gitlab saw an increase in latency using Puma compared to Unicorn - after this patch, latency was the same (they also were able to reduce their fleet size by almost 30% thanks to Puma’s memory-saving multithreaded design).

There may be even more efficient ways for us to implement this behavior in the future. There’s some magic you can do with libev, I’m pretty sure, or we can just implement a different sleep/wait strategy.

fork_worker

Adding fork_worker to your puma.rb config file (or --fork-worker from the CLI) turns on this feature. This mode causes Puma to fork additional workers from worker 0, instead of directly from the master process:

 \_ puma 5.0.0 (tcp://0.0.0.0:9292) [puma]
     \_ puma: cluster worker 0: 10000 [puma]
         \_ puma: cluster worker 1: 10000 [puma]
         \_ puma: cluster worker 2: 10000 [puma]
         \_ puma: cluster worker 3: 10000 [puma]

Similar to the preload_app! option, the fork_worker option allows your application to be initialized only once for copy-on-write memory savings, and it has two additional advantages:

Compatible with phased restart. Because the master process itself doesn’t preload the application, this mode works with phased restart (SIGUSR1 or pumactl phased-restart), unlike preload_app!. When worker 0 reloads as part of a phased restart, it initializes a new copy of your application first, then the other workers reload by forking from this new worker already containing the new preloaded application.

This allows a phased restart to complete as quickly as a hot restart (SIGUSR2 or pumactl restart), while still minimizing downtime by staggering the restart across cluster workers.

‘Refork’ for additional copy-on-write improvements in running applications. Fork-worker mode introduces a new refork command that re-loads all nonzero workers by re-forking them from worker 0.

This command can potentially improve memory utilization in large or complex applications that don’t fully pre-initialize on startup, because the re-forked workers can share copy-on-write memory with a worker that has been running for a while and serving requests.

You can trigger a refork by sending the cluster the SIGURG signal or running the pumactl refork command at any time. A refork will also automatically trigger once, after a certain number of requests have been processed by worker 0 (default 1000). To configure the number of requests before the auto-refork, pass a positive integer argument to fork_worker (e.g., fork_worker 1000), or 0 to disable.

nakayoshi_fork

Add nakayoshi_fork to your puma.rb config to try this option.

Nakayoshi means “friendly”, so this is a “friendly fork”. The concept was originally implemented by MRI supercontributor Koichi Sasada in a gem, but we wanted to see if we could bring a simpler version into Puma.

Basically, we just do the following before forking a worker:

4.times { GC.start }
GC.compact # if available

The concept here is that we’re trying to get as clean of a Ruby heap as possible before forking to maximize copy-on-write benefits. That should, in turn, lead to reduced memory usage.

Other New Features

A few more things in the grab-bag:

You can now compile Puma on machines where OpenSSL is not installed.
There is now a thread-backtraces command in pumactl to print all active threads backtraces. This has been available via SIGINFO on Darwin, but now it works on Linux via this new command.
Puma.stats now has a requests_count counter.
lowlevel_error_handler got some enhancements - we also pass the status code to it now.
Phased restarts and worker timeouts should be faster.
Puma.stats_hash provides Puma statistics as a hash, rather than as JSON.

Loads of Bugfixes

The number of bugfixes in this release is pretty huge. Here’s the most important ones:

Shutdowns should be more reliable.
Issues surrounding socket closing on shutdown have been resolved.
Fixed some concurrency bugs in the Reactor.
out_of_band should be much more reliable now.
Fixed an issue users were seeing with ActionCable and not being able to start a server.
Many stability improvements to prune_bundler.

Nicer Internals and Tests

This release has seen a massive improvement to our test coverage. We’ve pretty much doubled the size of the test suite since 4.0, and it’s way more stable and reproducible now too.

A number of breaking changes come with this major release. For the complete list, see the HISTORY file.

Thanks to Our Contributors!

This release is our first major or minor release with new maintainer MSP-Greg on the team. Greg has been doing tons of work on the test suite to make it more reliable, as well as a lot of work on our SSL features to bring them up-to-date and more extendable. Greg is also our main Windows expert.

The following people contributed more than 10 commits to this release:

Tim Morgan
Vyacheslav Alexeev
Will Jordan
Jeff Levin
Patrik Ragnarsson, who’s also been very helpful in our Issues tracker.

If you’ve like to make a contribution to Puma, please see our Contributors Guide. We’re always looking for more help and try to make it as easy as possible to contribute.

Enjoy Puma 5!

Want a faster website?

I'm Nate Berkopec (@nateberkopec). I write online about web performance from a full-stack developer's perspective. I primarily write about frontend performance and Ruby backends. If you liked this article and want to hear about the next one, click below. I don't spam - you'll receive about 1 email per week. It's all low-key, straight from me.

Products from Speedshop

The Complete Guide to Rails Performance is a full-stack performance book that gives you the tools to make Ruby on Rails applications faster, more scalable, and simpler to maintain.

Learn more

The Rails Performance Workshop is the big brother to my book. Learn step-by-step how to make your Rails app as fast as possible through a comprehensive video and hands-on workshop. Available for individuals, groups and large teams.

Learn more

Announcing the Rails Performance Apocrypha

I've written a new book, compiled from 4 years of my email newsletter.

The Practical Effects of the GVL on Scaling in Ruby

MRI Ruby's Global VM Lock: frequently mislabeled, misunderstood and maligned. Does the GVL mean that Ruby has no concurrency story or CaN'T sCaLe? To understand completely, we have to dig through Ruby's Virtual Machine, queueing theory and Amdahl's Law. Sounds simple, right?

The World Follows Power Laws: Why Premature Optimization is Bad

Programmers vaguely realize that 'premature optimization is bad'. But what is premature optimization? I'll argue that any optimization that does not come from observed measurement, usually in production, is premature, and that this fact stems from natural facts about our world. By applying an empirical mindset to performance, we can...

Why Your Rails App is Slow: Lessons Learned from 3000+ Hours of Teaching

I've taught over 200 people at live workshops, worked with dozens of clients, and thousands of readers to make their Rails apps faster. What have I learned about performance work and Rails in the process? What makes apps slow? How do we make them faster?

3 ActiveRecord Mistakes That Slow Down Rails Apps: Count, Where and Present

Many Rails developers don't understand what causes ActiveRecord to actually execute a SQL query. Let's look at three common cases: misuse of the count method, using where to select subsets, and the present? predicate. You may be causing extra queries and N+1s through the abuse of these three methods.

The Complete Guide to Rails Performance, Version 2

I've completed the 'second edition' of my course, the CGRP. What's changed since I released the course two years ago? Where do I see Rails going in the future?

A New Ruby Application Server: NGINX Unit

NGINX Inc. has just released Ruby support for their new multi-language application server, NGINX Unit. What does this mean for Ruby web applications? Should you be paying attention to NGINX Unit?

Malloc Can Double Multi-threaded Ruby Program Memory Usage

Memory fragmentation is difficult to measure and diagnose, but it can also sometimes be very easy to fix. Let's look at one source of memory fragmentation in multi-threaded CRuby programs: malloc's per-thread memory arenas.

Configuring Puma, Unicorn and Passenger for Maximum Efficiency

Application server configuration can make a major impact on the throughput and performance-per-dollar of your Ruby web application. Let's talk about the most important settings.

Is Ruby Too Slow For Web-Scale?

Choosing a new web framework or programming language for the web and wondering which to pick? Should performance enter your decision, or not?

Railsconf 2017: The Performance Update

Did you miss Railsconf 2017? Or maybe you went, but wonder if you missed something on the performance front? Let me fill you in!

Understanding Ruby GC through GC.stat

Have you ever wondered how the heck Ruby's GC works? Let's see what we can learn by reading some of the statistics it provides us in the GC.stat hash.

Rubyconf 2016: The Performance Update

What happened at RubyConf 2016 this year? A heck of a lot of stuff related to Ruby performance, that's what.

What HTTP/2 Means for Ruby Developers

Full HTTP/2 support for Ruby web frameworks is a long way off - but that doesn't mean you can't benefit from HTTP/2 today!

How Changing WebFonts Made Rubygems.org 10x Faster

WebFonts are awesome and here to stay. However, if used improperly, they can also impose a huge performance penalty. In this post, I explain how Rubygems.org painted 10x faster just by making a few changes to its WebFonts.

Page Weight Doesn't Matter

The total size of a webpage, measured in bytes, has little to do with its load time. Instead, increase network utilization: make your site preloader-friendly, minimize parser blocking, and start downloading resources ASAP with Resource Hints.

Hacking Your Webpage's Head Tags for Speed and Profit

One of the most important parts of any webpage's performance is the content and organization of the head element. We'll take a deep dive on some easy optimizations that can be applied to any site.

How to Measure Ruby App Performance with New Relic

New Relic is a great tool for getting the overview of the performance bottlenecks of a Ruby application. But it's pretty extensive - where do you start? What's the most important part to pay attention to?

Ludicrously Fast Page Loads - A Guide for Full-Stack Devs

Your website is slow, but the backend is fast. How do you diagnose performance issues on the frontend of your site? We'll discuss everything involved in constructing a webpage and how to profile it at sub-millisecond resolution with Chrome Timeline, Google's flamegraph-for-the-browser.

Get notified on new posts.

Straight from the author. No spam, no bullshit. Frequent email-only content.