Faster Ruby: Thoughts from the Outside

(This is Part II of the Faster Ruby posts, which started with a retrospective on Ruby+OMR, a Ruby JIT compiler I worked on five years ago)

As someone who comes from a compiler background, when asked to make a language fast, I’m sympathetic to the reaction: “Just throw a compiler at it!”. However, working on SpiderMonkey, I’ve come to the conclusion that a fast language implementation has many moving parts, and a compiler is just one part of it.

I’m going to get to Ruby, but before I get there, I want to take a tour briefly of some bits and pieces of SpiderMonkey that help make it a fast JavaScript engine; from that story, you may be able to intuit some of my suggestions for how Ruby ought to move forward!

Good Bones in a Runtime

It’s taken me many years of working on SpiderMonkey to internalize some of the implications of various design choices, and how they drive good performance. For example, let’s discuss the object model:

In SpiderMonkey, a JavaScript Object consists of two pieces: A set of slots, which store values, and a shape, which describes the layout of the object (which property ends up in which slot)

Shapes are shared across many objects with the same layout:

var a = [];
for (var i = 0; i < 1000; i++) { 
    var o = {a: 1, b: 2};
  a.push(o)
}

In the above example, there are a thousand objects in the array, but all those objects share the same shape.

Recall as well, that JavaScript is a prototype language; each object has a prototype; so there’s a design decision: for a given object, where do you store the prototype?

It could well be in a slot on the object, but that would bloat objects. Similar to how layouts are shared across many different objects, there are many objects that share a prototype. In the above example, every object in the array has a prototype of Object.protoype. We therefore associate the prototype of an objet not with the object itself, but rather with the shape of the object. This means that when you mutate the prototype of an object (Object.setPrototypeOf), we have to change the shape of the object.

Given that all property lookup is based on either the properties of an object, or the prototype chain of an object, we now have an excellent key upon which to build a cache for property access. In SpiderMonkey, these inline caches are associated with property access bytecodes; each stub in the inline cache chain for a bytecode trying to do a property load like o.b ends up looking like this:

if (!o.hasShape(X)) { try next stub; } 
return o.slots(X.slotOf('b'))

Inline Caches are Stonkingly Effective

I’ve been yammering on about inline caches to pretty much anyone who will listen for years. Ever since I finally understood the power of SpiderMonkey’s CacheIR system, I’ve realized that inline caches are not just a powerful technique for making method dispatch fast, but they’re actually fundamental primitives for handling a dynamic language’s dynamism.

So let’s look briefly at the performance possibilities brought by Inline Caches:

Octane Scores (higher is better):
Interpreter, CacheIR, Baseline, Ion: 34252  (3.5x) (46x total)
Interpreter, CacheIR, Baseline:      9887   (2.0x) (13x total)
Interpreter, CacheIR:                4890   (6.6x)
Interpreter:                         739

Now: Let me first say outright, Octane is a bad benchmark suite, and not really representative of the real web… but it runs fast and produces good enough results to share in a blog post (details here).

With that caveat however, you can see the point of this section: well designed inline caches can be STONKINGLY EFFECTIVE: just adding our inline caches improves performance by more than 6x on this benchmark!

The really fascinating thing about inline caches, as they exist in SpiderMonkey, is that they serve to accelerate not just property accesses, but also most places where the dynamism of JavaScript rears its head. For example:

function add(a,b) { return a + b; } 
var a = add(1,1);
var b = add(1,"1");
var c = add("1","1");
var d = add(1.5,1);

All these different cases have to be handled by the same bytecode op, Add.

loc     op
——   ——
main:
00000:  GetArg 0                        # a
00003:  GetArg 1                        # a b
00006:  Add                             # (a + b)
00007:  Return                          #

So, in order to make this fast, we add an Inline Cache to Add, where we attach a list of type-specialized stubs. So the first stub would be be specialized to the Int32+Int32 case, the second to the Int32+String and so on and so forth.

Since typically types are relatively stable at a particular bytecode op, this strategy is very effective for speeding up execution time.

Making Ruby Fast: Key Pieces

Given the above story, you would be unsurprised to hear that I would suggest starting with improving the Ruby Object model, providing shapes.

The good news for Ruby is that there are people from Shopify pushing this exact thing. This blog post, The Future Shape of Ruby Objects, from Chris Seaton is a far more comprehensive and Ruby focused introduction to shapes than I wrote above, and the bug tracking this is here.

The second thing I’d do is invest in just enough JIT compilation technology to allow the creation of easy to reason about inline caches. Because I come from SpiderMonkey, I would almost certainly shamelessly steal the general shape of CacheIR, as I think Jan de Mooij has really hit on something special with its design. This would likely be a very simple template-JIT, done method-at-a-time.

When I worked on Ruby+OMR I didn’t have a good answer for how to handle the dynamism of Ruby, due to a lack of practical experience. There was a fair amount of hope that we could recycle the JIT profiling architecture from J9, accumulating data from injected counters in a basic compilation of a method, and feeding into a higher-optimizing recompilation that would specialize further. It’s quite possible that this could work! However, having seen the inline caching architecture of SpiderMonkey, I realize now that JIT profiling would have been maybe the costliest way we could generate the data we would need for type specialization. I may well have read this paper, but I don’t think I understood it.

Today in SpiderMonkey, all our type profiling is done through our inline caches. Our top tier compiler frontend, WarpBuilder, analyzes the inline cache chains to determine what is the actual important cases we should speculate on. We even do a neat trick with ICs to power smart inlining. Today, the thing I wish a project like OMR would provide most is the building blocks for a powerful inline cache system.

In the real world, YJIT is a really interesting JIT for Ruby being built around the fascinating Basic Block Versioning (BBV) architecture that Maxime Chevalier-Boisvert built during her PhD, an architecture I and other people who have worked on SpiderMonkey really admired as innovative. As I understand it, YJIT doesn’t need to lean on inline caching nearly as much as SpiderMonkey does, as much of the optimizations provided naturally fall out of the versioning of basic blocks. Still, in her blog post introducing YJIT, Maxime does say that even YJIT would benefit from shapes, something I definitely can believe.

Open Questions, to me at least

  • Does Ruby in 2022 still have problems with C-extensions? Once upon a time we were really concerned about how opaque C-extensions were. TruffleRuby used the really neat Sulong technology to solve this.

    Does the C-extension API need to be improved to allow a fast implementation to exist? Unsure.

    SpiderMonkey has the advantage of working in a ‘closed world’ mostly, where native code integrations are fairly tightly coupled. This doesn’t describe Ruby Gems that lean on the C-APIs.

  • What kind of speedup is available for big Rails applications? If 90% of the time in an application is spent in database calls, then there’s little opportunity for improvement via JIT technologies.

Conclusion

I’ve been out of the Ruby game for a long time. Despite that, I find myself thinking back to it frequently. Ruby+OMR was, in hindsight, perhaps a poor architectural fit. As dynamic as Java is, languages like JavaScript and Ruby mean that the pressure on compilation technology is appreciably different.

With the lessons I’ve learned, it seems to me that a pretty fast Ruby is probably possible. JavaScript is a pretty terrible language to make fast, and it’s achieved it (having a huge performance war between implementations, causing huge corporations to pour resources into JS performance helped… maybe Ruby needs a performance war). I’m glad to see the efforts coming out of Shopify — I really think they’ll pay huge dividends over the next few years. I wish that team much luck.

(There’s some really excellent discussion about this post over at Hacker News)