Packagecloud logo

Micro-optimizations matter: preventing 20 million system calls

TL;DR

This blog post is a followup on our previous post How setting the TZ environment variable avoids thousands of system calls. In this post, we’ll explore a particularly prominent case where a micro-optimization (like removing a system call in a hot path) had a drastic effect on software performance.

What is a “noticeable” improvement?

We previously described an environment variable that an application can set to avoid thousands of extra system calls.

Our post was met with some reasonable questions and a bit of skepticism:

  • “This is neat, but does removing a single system call result in any noticeable performance improvement for a program?”
  • “This seems unnecessary; system calls are fast on Linux, anyway.”

It’s difficult to say what “noticeable” might mean for each developer and their application. Kernel and driver developers often devote a considerable amount of time to micro-optimize their code and data structures in such a way as to maximize usage of CPU cache and reduce the amount of CPU usage, even by what most programmers might consider tiny amounts. Should application programmers pay less mind to these sorts of optimizations? After all, one might consider these micro-optimizations as not being “noticeable” either.

For the purposes of this article, let’s take “noticeable” to mean “easily measurable and very obvious.” Can we show a real world example of when removing a slow (non-vDSO enabled) system call from a code path was easily measurable and very obvious?

It turns out that there are infact many instances of this in the wild: everything from packet sniffers to programming language runtimes. Let’s look at a particularly infamous case: sigprocmask and its effect on the Ruby language runtime.

 

What is sigprocmask?

sigprocmask is system call that is used to check or set the current process’ signal mask. This allows a program to block or permit signals and is useful when a program needs to execute a critical section of code that cannot be interrupted.

It is not a particularly complex system call. The kernel code for sigprocmask shows that it writes a sigset_t to a C structure that contains the current process’ state (called a task_struct in the kernel); this is a very fast operation.

Let’s create a simple test program that calls sigprocmask in a tight loop and use strace and time to measure it:

#include <stdlib.h>
#include <signal.h>

int
main(int argc, char *argv[])
{
	int i = 0;
	sigset_t test;

	for (; i < 1000000; i++) {
		sigprocmask(SIG_SETMASK, NULL, &test);
	}

	return 0;
}

Compile with gcc -o test test.c and execute first with time and then again with strace and time.

On my test system:

  • Running simply with time ./test, yields on the order of 0m0.047s, 0m0.012s, and 0m0.036s real, user, and system time, respectively.
  • Running again with time stace -ttT ./test, shows overhead imposed by strace quite clearly with 0m52.364s, 0m9.313s, and 0m14.349s real, user, and system time, respectively.

The strace output will show approximate execution times for each sigprocmask call (which will likely show up as rt_sigprocmask on your system). The approximate times are very small. On my test system, I get something like 0.000003 most of the time, with an occasional jump to 0.000074.

(By the way: it turns out that measuring exact system call execution time is difficult for a number of reasons all of which are far outside the scope of this article. We can assume, though, that all of these measurements are equally incorrect.)

Let’s review what we know:

  • sigprocmask is a system call used to set or check the current process’ signal mask.
  • The kernel side code is simple and should execute quickly.
  • The time and strace output both show that executing 1 million sigprocmask calls in a tight loop terminates quickly with each call taking a very small amount of time.

With numbers like this how could removing extra calls to sigprocmask ever be worth the effort?

 

Look deeper

Sometimes, other code we use in our applications (like system libraries, the kernel, glibc, etc) does things that we don’t quite expect or has side effects that don’t seem obvious up front. To illustrate this point, I’m going to show a severe performance penalty that was incurred by indirectly using sigprocmask in a test program and then show how this manifested in a real world application.

One of the most prominent examples of extra sigprocmask calls leading directly to an obvious and easily measurable performance degradation was in our old friend Ruby 1.8.7. The performance loss occurs when Ruby 1.8.7 is compiled with one particular configure flag.

We’ll start by using Ruby 1.8.7 with the default configure flags used on most operation systems (Debian, Ubuntu, et. al.) at the time Ruby 1.8.7 was widely used. We’ll examine a test case illustrating the performance loss followed by a real world example and an explanation of how this happened in Ruby.

 

sigprocmask test case

Let’s take a look at the following test code:

def make_thread
  Thread.new do
    a = []
    10_000_000.times do
      a << "a"
      a.pop
    end
  end
end

t = make_thread
t1 = make_thread

t.join
t1.join

This code is simple: it creates two threads, each of which adds and removes data to an array 10 million times. If we run this with the default Ruby 1.8.7 configure flags and strace, we’ll see some surprising results:

$ strace -ce rt_sigprocmask /tmp/test-ruby/usr/local/bin/ruby /tmp/test.rb
Process 30018 attached
Process 30018 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  0.50    0.139288           0  20033025           rt_sigprocmask

Over 20 million system calls for sigprocmask were generated by the Ruby VM from the code I pasted above. You may think: “But, look at how little time was consumed! There’s no way this matters!”

As mentioned before, measuring system call time is a bit complicated. Let’s re-run the test program with time (but not strace) and get a reading on how long this test program takes to complete on my system:

$ time /tmp/gogo/usr/local/bin/ruby /tmp/test.rb

real        0m6.147s
user        0m5.644s
sys         0m0.476s

So, about 6 seconds real time to execute. This means the code is making roughly 3.3 million sigprocmask calls per second. Wow.

If we adjust one configure flag for building Ruby slightly, we can get the Ruby VM build system to avoid making calls to sigprocmask!

Let’s re-run the strace and time tests, but with a Ruby that has been adjusted slightly to avoid making sigprocmask calls and see what we get:

$ strace -ce rt_sigprocmask /tmp/test-ruby-2/usr/local/bin/ruby /tmp/test.rb
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
  -nan    0.000000           0         3           rt_sigprocmask

Whoa! We’ve reduced the calls to sigprocmask from over 20 million to 3. Looks like strace had trouble computing the system call execution time :(

Let’s see what time says:

$ time /tmp/test-ruby-2/usr/local/bin/ruby /tmp/test.rb

real	0m3.716s
user	0m3.692s
sys   0m0.004s

Roughly 40% faster (real time) than our previous example and less than 1 sigprocmask call per second.

So, that’s a cool result, but it raises a few important questions:

  • Isn’t this example a bit contrived?
  • When would this actually matter in the real world?
  • What is going on when adjusting the configure flags that is actually causing this?

Let’s look at a real world example of when this mattered and then we’ll dive into some of the details of what is actually going on.

 

Real world example: Puppet

A bug filed against Puppet shows precisely the impact of extra sigprocmask system calls in the Ruby VM:

I’m currently facing what I consider serious performance issues. In fact, Puppet seems to be very slow in general.

Let me show you a first example:

$ time puppet —version 0.24.5

real 0m0.718s user 0m0.576s sys 0m0.140s

During that time, there is hundreds, if not thousands “rt_sigprocmask(SIG_BLOCK, NULL, [], 8)” calls being made. All this to show the version.

If you follow the thread, you can find more comments complaining about the poor performance of Puppet and further seaches show stackoverflow questions expressing the same concern.

It’s not just Ruby and Puppet, though. You can find similar bug reports for other projects with users reporting 100% CPU usage and hundreds of thousands of calls to sigprocmask melting their CPUs.

 

How does this happen? Shouldn’t this be easy to fix?

What’s really going on here is that the calls to sigprocmask are coming from inside of a pair of glibc functions (that are not system calls): getcontext and setcontext.

These two functions are used to save and restore the CPU state and are commonly used by programs and libraries implementing exception handling or userland threads. In the case of Ruby 1.8.7: setcontext and getcontext are used as part of the userland thread implementation to context switch between threads.

You’d imagine that this pair of functions would execute quickly, after all, they just need to save or restore a small set of CPU registers. It turns out that in addition to saving the set of CPU registers (a very fast operation) the implementations of these functions in glibc are such that the sigprocmask system call is used to save or restore the signal mask.

Recall that Linux provides a mechanism (the vDSO) for certain system calls to be executed in userland instead of the kernerl in order to reduce the cost associated with making those system calls. Unfortunately, sigprocmask is not one of the system calls that is provided by the vDSO. All sigprocmask system calls will cause a transition from userland to the kernel.

The cost of this transition is much higher than the cost of other operations in setcontext or getcontext (which are a few simple memory writes). If you are calling these functions very often you are now performing a slow operation (in this case a system call to sigprocmask, which does not pass through the vDSO) every time you intended to do something quickly (like saving or restoring the CPU register set to switch threads).

 

Why does changing a configure flag make this go away?

At the time Ruby 1.8.7 was widely used, it was by default configured with --enable-pthread which enabled a separate OS-level timer thread that would occasionally run to pre-empt the Ruby VM. This is how the Ruby VM knew when it was time to switch between the userland threads that were mapped to threads created in Ruby programs. Using --enable-pthread also caused the configure script to search for and use the functions getcontext and setcontext. If you didn’t use --enable-pthread, the configure script would instead search for and use _setjmp and _longjmp (note the underscores) instead. These functions do not save or restore the signal mask, and thus, thus they do not make the system call to sigprocmask.

So:

  • --enable-pthread and --disable-pthread were intended to be used to control whether the Ruby VM’s userland threading implementation would use an OS level thread or a simple VTALRM signal to notify the Ruby VM when it was time to switch execution to a different Ruby thread.
  • The unintended side effect of switching between these two pre-emptiong methods is that the underlying primitives used to implement context switching are swapped between setcontext/getcontext and _setjmp/_longjmp.
  • The additional side effect of this low level function swap is either additional calls to sigprocmask (if the setcontext/getcontext pair is chosen) or not.
  • The additional side effect being reduced performance if the pair which uses sigprocmask is chosen.
  • The real world result in which users of Puppet (and other software) have a poor experience.

Phew; all of this from one configure flag that essentially enables or disables a single system call in a hot path.

Conclusion

Micro-optimizations do matter, but the extent to which they matter is, of course, application specific. Being aware that the libraries and code you depend on can do things you might not expect (like make a slow system call in a hot path) and knowing which tools to use to detect and correct these sorts of issues can have a vast impact on your users and, in some cases, your users’ users.

You might also like other posts...