Deoptimization Engine #1419

shyouhei · 2016-08-26T01:52:38Z

Abstract

Implemented a way to optimize ruby's executions and to revert them. The strategy is restricted so that any VM states like program counter(s) would not be affected by the modifications. This restriction makes deoptimize much lightweight. Experiments show optimizations on this mechanism speeds up micro benchmarks, yet has no significant overheads for pitfall-ish activities, like eval.

Introduction

Ruby is said to be slow[1]. While there can be arguments on this, experiments show that it is at least not the fastest language to execute[2]. …

The reason behind this is fingered as its GC being slow, or its GVL, or its dynamic nature. But the author would like to point out the most important thing that Ruby lacks; it does not even try to optimize its execution. We believe this is the root cause of the slowness we suffer.

We wrote "try to" above. This is the essential concept of this pull request. Why Ruby did not optimize so far was that its execution might be different on occasions. For instance a 1 + 2 might not always be 3, because the + can be redefined on-the-fly. So the interpreter does not constant-fold the expression and looks for Integer#+ every time.

This redefinition however, rarely happens. Almost every time when 1 + 2 is evaluated, its value is arguably 3. Given this empirical assessment, it makes sense for Ruby to first try 3 and if Integer#+ was redefined for any reason, fall back to that. It should be much faster for vast majority of normal (not redefined) cases, while it does not break redefinition with some overheads.

This sort of technique is called deoptimization[3]. Deoptimization is done more or less on other languages / implementations. Notably JVM related projects has many research in this area. For instance JRuby+Truffle already has rich application of this technique[4].

What is pull-requested here is not the direct C translation of JRuby+Truffle's. Let me describe what is going on.

The deoptimization strategy

Since 1.9, Ruby's interpreter equips a language VM (visible as ::RubyVM from inside ruby scripts). It first generates sequences ("iseq"s hereafter) of VM instructions ("insn"s hereafter), and executes them in series. The key structure of this mechanism is called struct rb_iseq_constant_body.

struct rb_iseq_constant_body {
    (…snip…)

    unsigned int iseq_size;
    const VALUE *iseq_encoded; /* encoded iseq (insn addr and operands) */

    (…snip…)
};

The iseq_encoded field is what VM scans to execute. This is where we want to optimize. Prior to do any modifications, we hold a backup of it using memcpy. In case of deoptimization event, we just overwrite this memory region with backed-up sequence using memcpy to revert whatever optimizations that might have been applied to it.

The upside of this strategy is that it is fairly simple. Because everything is written in pure C code we have no portability concerns. Also it is so simple that the deoptimization does not involve the VM states at all, including program counters. This has advantage when execution of instructions recurs; if we had to update program counters, we must scan the whole VM stacks to keep consistencies between before / after deoptimization process.

The downside of this strategy is that it restricts what can be optimized. For instance we cannot reorder instructions, because exceptions might raise inside of reordered sequences. We cannot change the length of iseq, because the underlying memory region cannot be reallocated.

However, even with the given restriction, we can do something.

Implemented optimizations

We have following optimizations in this pull request.

Constant folding

For instance a 1 + 2 is optimized into 3 like this.

--- /dev/shm/xj79gt      2016-08-17 17:09:31.000000000 +0900
+++ /dev/shm/1gaaeo      2016-08-17 17:09:31.000000000 +0900
@@ -20,8 +20,10 @@ local table (size: 2, argc: 0 [opts: 0,
 |------------------------------------------------------------------------
 0000 trace            256                                             (   7)
 0002 trace            1                                               (   8)
-0004 putobject_OP_INT2FIX_O_1_C_
-0005 putobject        2
-0007 opt_plus         <callinfo!mid:+, argc:1, ARGS_SIMPLE>, <callcache>
+0004 putobject        3
+0006 nop
+0007 nop
+0008 nop
+0009 nop
 0010 trace            512                                             (   9)
 0012 leave                                                            (   8)

The nops are introduced to meet "no PC modification" restriction. In order to implement this, the opt_plus instruction is modified to detect a putobject-putobject-opt-plus sequence. Then it calculates the computation as usual, and overwrites itself with the sequence like above diff.

Ruby-level constants (like ::Object) are also subject to fold. They are resolved and replaced to putobject with the identical value.

`send` elimination

Typical ruby script consists of many method invocations. They are represented in send or its variant instructions in an iseq. If they are optimized, the effect should not be marginal. The problem is a method call is not always optimizable. We have to determine whether we can skip calling.

In order to do so a method is called as usual for the first time, and checked whether it has any side effects or not. If a method does not contain any side effects, it (more precisely, its underlying iseq) is marked as being "pure". Next time when the same method is called, we can consider eliminating the call to it.

However a method that is not pure is not always NG to optimize. A method can of course call another method. A method that is not pure can be simply not checked yet. So a method has 3 states namely "pure", "not pure", and "unpredictable". Every methods start their life as being unpredictable. Once its calling methods are all predicted to be pure, and itself has no side-effects, the state is updated to be pure.

enum insn_purity {
    insn_is_pure          = Qtrue,
    insn_is_not_pure      = Qfalse,
    insn_is_unpredictable = Qnil,
};

Side effect that we care includes:

Writes to variables other than local ones,
Definition of classes,
Call of methods that are not pure,
Call of C functions,
- C functions are unable to check so we think they are all NG.
Yielding blocks.

Once we know a method is pure, it is safe to eliminate a call to that method which immediately discards its return value(s). Note however, that even on such situations, evaluations of method arguments are not always eligible to be eliminated. They can have their own side effects. We just eliminate the send variant instruction and not its caller site setup, like following diff:

--- /dev/shm/165rrgd     2016-08-17 10:44:10.000000000 +0900
+++ /dev/shm/jd0rcj      2016-08-17 10:44:10.000000000 +0900
@@ -23,8 +23,10 @@ local table (size: 2, argc: 0 [opts: 0,
 0004 putself
 0005 putself
 0006 opt_send_without_block <callinfo!mid:n, argc:0, FCALL|VCALL|ARGS_SIMPLE>, <callcache>
-0009 opt_send_without_block <callinfo!mid:m, argc:1, FCALL|ARGS_SIMPLE>, <callcache>
-0012 adjuststack      1
+0009 adjuststack      2
+0011 nop
+0012 nop
+0013 nop
 0014 trace            1                                               (  16)
 0016 putnil
 0017 trace            512                                             (  17)

This is how a m(n()) is optimized, where method m is pure but n is not.

Elimination of variable assignments

As we see, method calls can not be eliminated when its return value is used. In order to relax this we would like to skip assignments of variables if possible.

However this is not easy to do precisely. A variable's liveness shall be analyzed. All types of variables except local ones are almost unable to analyze. Also because we have restriction on what is allowed here due to deoptimization process, massive iseq rearrangements like SSA conversion is not doable.

In this pull request a very limited version of variable liveness analysis is implemented. Only
local variables which are never referenced (assigned then made garbage immediately) are detected, then eliminated if possible. This is not the only thing that could be done theoretically, but practically the easiest to detect so we do this as a bridgehead to further optimizations.

Note we have to check iseqs recursively; because local variables are reachable from inside of a block, and a block is represented in a separate iseq, which means we have to check all possible iseqs that can reach a local variable to ensure no references to it exists.

After checking for usages, a useless assignment can be eliminated like this:

--- /dev/shm/1p1dusf     2016-08-25 18:06:58.000000000 +0900
+++ /dev/shm/1kpw0qe     2016-08-25 18:06:58.000000000 +0900
@@ -17,10 +17,14 @@ local table (size: 3, argc: 1 [opts: 0,
 [ 3] i<Arg>     [ 2] x
 0000 trace            256                                             (  12)
 0002 trace            1                                               (  13)
-0004 putself
-0005 getlocal_OP__WC__0 3
-0007 opt_send_without_block <callinfo!mid:m, argc:1, FCALL|ARGS_SIMPLE>, <callcache>
-0010 setlocal_OP__WC__0 2
+0004 nop
+0005 nop
+0006 nop
+0007 nop
+0008 nop
+0009 nop
+0010 nop
+0011 nop
 0012 trace            1                                               (  14)
 0014 putnil
 0015 trace            512                                             (  15)

Experiments

To determine effectiveness of this approach we took experiments on my environment. This machine equips Intel(R) Core(TM)2 Duo CPU T7700 CPU and runs Ubuntu 16.04. The image below is a make benchmark result against ruby 2.4.0dev.

Looking at the benchmark, the result is (generally speaking) similar to 2.4, with several exceptional cases which speeds up. The most improved vm2_bigarray* got huge speedup because generation of a big array is eliminated at all. Other improved benchmarks shows optimizations described above works quite well.

It is worth noting that the vm2_eval* result is 0.933 i.e. eval is only 2.7% slow compared to no optimization. This can be considered as deoptimization preparation cost, and it means overhead is quite small.

Conclusions

Implemented a deoptimization engine for ruby and several optimizations on top of it. The proposed deoptimization strategy is lightweight so that activities like eval works without huge overheads.

The author is not willing to argue this is the only best optimization technique that we know. Rather, even if it is suboptimal, the benchmark results show that the optimizations proposed here gains considerable amount of speed up. This is why ruby is said to be slow; it is not optimized at all. Any trivial optimizations should significantly impact it.

We believe ruby should have optimizations implemented.

Future works

More optimizations can be thought on top of this deoptimization engine. For instance branches with constant expressions could be relaxed to unconditional jumps. More analysis on variable liveness could eliminate more of them. Perversely, we could think of adding new local variables, like to eliminate common subexpressions.

The deoptimization strategy is subject to be improved. If we can suffer more slowdown of eval, more fine-grained setup like three-address code could be thought of. In order to do so a deoptimization engine should be able to revert IR conversion.

To suppress warning, add attribute if possible. * template/insns_info.inc.tmpl (insn_len): now that this file is included from more than one file, chances are that those static functions can be left unused depending on compilation unit. On such situation in order to prevent compiler warning we annotate this function being possibly unused. * template/insns_info.inc.tmpl (insn_name): ditto. * template/insns_info.inc.tmpl (insn_op_type): ditto. * template/insns_info.inc.tmpl (insn_op_types): ditto. * configure.in (MAYBE_UNUSED): check if unused attribute is supported by the compiler.

This is important because signed operations shall not overflow in C. If you want it rouldtrip you can just cast to unsigned then cast back, but if you want to avoid overflows completely, I believe this is the fastest method. * internal.h (SATURATION_ADD): new macro that does addition of two variables in satirated integer arithmetic. * configure.in: check for __builtin_add_overflow compiler intrinsic which is available in recent gcc and clang, with expectedly-optimal implementation of overflow detection.

This is handy when the current insn is auto-generated (e.g. operand unification). This by itself has no runtime overheads. * tool/instruction.rb (RubyVM::Instruction#sp_increase_macro_expr): new method, to define a macro inside of each insn. * tool/instruction.rb (RubyVM::VmBodyGenerator#make_header_defines): a bunch of new macros to define, which are (mostly) compile-time constants, that characterize many of instructions, such as its name. * tool/instruction.rb (RubyVM::VmBodyGenerator#make_footer_undefs): clean up the new macros.

The struct rb_id_table can hold arbitrary VALUE values. Now that this struct is reachable form Ruby's object space, it must understand what GC requests to it. * id_table.c (rb_id_table_mark): new function to avoid GC leak for id tables. This function applies against all implementations. * id_table.h (rb_id_table_mark): ditto.

This struct is ID-to-VALUE mapping. Which must be convertible to a Hash. * id_table.h (rb_id_table_to_h): new function to convert a struct rb_id_table to a Hash. Expected to be used for inspection purpose from inside a ruby script. * id_table.c (rb_id_table_to_h): ditto.

This implements arbitrary key-value pair on each ISeq, that are meant to be used when optimize. Much like C++11's generalized attributes. Note that, as this is meant to be optimization purpose, you cant modify the attributes from a ruby script. The ruby level API is for debug purpose. * vm_core.h (rb_iseq_constant_body): new attributes field to hold arbitrary key-value for any ISeq. * iseq.c (rb_iseq_annotate): ISeq attribute setter. This is intentionally not exposed to 3rd parties now, to prevent potentially malicious usage of this feature. * iseq.c (rb_iseq_annotated_p): obtain specific attribute of an ISeq. * iseq.c (iseqw_attributes): obtain attributes hash of an ISeq.

This variable is expected to be an integer type which can be incremented atomically. Expected to be used where certain object's "freshness" is vital, e.g. when invalidating a cache. * vm.c (ruby_vm_global_timestamp): new variable to hold per-VM timestamp counter. * vm_insnhelper.h (INC_GLOBAL_TIMESTAMP): macro to increase timestamp. * vm_method.c (rb_clear_constant_cache): increase timestamp. * vm_method.c (rb_clear_method_cache_by_class): ditto. * vm.c (rb_next_class_serial): ditto.

This commit adds a new file named deoptimize.c, which basically saves yet-to-optimize instruction sequence aside, and reverts to that state on occasions. The "state" here means the set of ISeq raw sequence and program counter. Or, the "optimization" assumes that everything other than that two are shared among optimized / non-optimized variant. Examples of such info includes for instance catch table; so under this infrastructure you cannot optimize an iseq which deals exceptions yet. That is to be developed later. There is no corresponding on-the-fly optimizer now; this commit only adds overheads. If you take a benchmark on this commit it should be (hopefully slightly) slower than the original. Any optimizations that utilizes this mechanism must be sufficiently faster than the overhead introduced in this commit. * deoptimize.h: new file. * deoptimize.c: new file. * common.mk (COMMONOBJS): dependencies for new files. * vm_core.h (rb_iseq_constant_body): new field. * iseq.c (rb_iseq_free): proper handling of new field. * iseq.c (iseq_memsize): ditto. * vm_insnhelper.c (vm_check_iseq_freshness): new function to check if given iseq is stale; if so, deoptimization is kicked. * vm_insnhelper.c (vm_push_frame): called from here right at the beginning of an iseq execution. * vm_insnhelper.h (CALL_METHOD): also, a method call can result in redefinition(s) of methods, like by requiring something. At a graceful return from a method we shall check freshness of current iseq too. * vm.c (rb_vm_global_timestamp): export this variable. * iseq.c (rb_iseq_new_with_opt): prepare deoptimiation

This commit adds on-the-fly ISeq analyzer. It detects an ISeq's purity, i.e. if that ISeq has side-effect or not. Purity is the key concept of whole optimization techniques in general, but in Ruby it is yet more important because there is a method called eval. A pure ISeq is free from eval, while those not pure are stuck in the limbo where any of its side effects _could_ result in (possibly aliased) call to eval. So an optimization tend not be possible against them. Note however, that the analyzer cannot statically say if the ISeq in question is pure or not. It categorizes an ISeq into 3 states namely pure, not pure, or "unpredictable". The last category is used when for instance there are branches yet to be analyzed, or method calls to another unpredictable ISeq. An ISeq's purity changes over time, not only by redefinition of methods, but by other optimizations, like, by entering a rarely-taken branch of a formerly-unpredictable ISeq to kick analyzer to fix its purity. Such change propagates to its callers. * optimize.c: new file. * optimize.h: new file. * common.mk (COMMONOBJS): dependencies for new files. * iseq.h (ISEQ_NEEDS_ANALYZE): new flag to denote the iseq in question might need (re)analyzing.

Introduces a mechanism to replace a pair of send + pop instructions into strength-reduced equivalent of adjuststack + nop(s), when that transformation is guaranteed possible. When some method redefinition happens to break that assumption, it falls back to its original sequence. * insns.def (adjuststack): kind of "eat"s immediately preceding instruction to move itself into there. This is the key part of the whole optimization we implement in this commit. * insns.def (send): now subject to be eliminated. When there is a sequence of "send then pop", and that send does not include any side effects, we can safely ignore those two instructions. * insns.def (opt_send_without_block): ditto. * insns.def (invokesuper): ditto. * insns.def (putnil): this insn is already done in iseq_peephole_optimization() before execution. But while optimization process goes on during runtime, additional places where this insn can be considered can appear. We would like to wipe them out also. * insns.def (putself): ditto. * insns.def (putobject): ditto. * insns.def (putstring): ditto. * insns.def (concatstringd): ditto. * insns.def (freezestring): ditto. * insns.def (toregexp): ditto. * insns.def (newarray): ditto. * insns.def (duparray): ditto. * insns.def (newhash): ditto. * insns.def (newrange): ditto. * compile.c (iseq_peephole_optimize): replace pop with adjuststack 1; possibility is this instruction can "eat" the immediately preceding instruction to increase the operand. * compile.c (REPLACE_ELEM): resurrect, because it now has a client. * vm.c: add necessary header files. * common.mk (vm.$(OBJEXT)): update to follow new dependencies. * optimize.c (wipeout_pattern): pattern buffer, filled with nop. Technically it is interesting because the pattern filled in is a pointer to a function-static variable of another function; it is breaking variable visibility. * optimize.c (construct_pattern): constructor function to fill in static pattern. * optimize.c (iseq_eliminate_insn): fill the pattern to patch the memory region the program counter currently points to. This is where the actual "optimization" takes place. * vm_core.h (cfp_last_insn): latest instruction's place, length, and stack increase. Note that it does not hold a direct pointer, because deoptimization can change instruction sequence during creation of this field and its use. * vm_core.h (rb_call_cache): new field to hold the "temperature" of the call site this cache represents. This field is explicitly declared signed because we ned sub-zero temperature. * vm_insnhelper.h (PREPARE_FOR_ELIMINATION): remember the instruction currently running to CFP, for possible later elimination. * vm_insnhelper.h (PREPARE_ELIMINATE_SENDISH): a variant, which remembers the current instruction only when it is hot. * vm_insnhelper.c (vm_eliminate_insn): because this optimization routine resides in a super duper hot path it is worth inlining some checks before actual function call is issued. This function is placed here for that reason. * vm_insnhelper.c (vm_is_hot): check if it is safe to optimize the given caller site. Note that the safety is tested against what is about to be called, not the currently running ISeq. * deoptimize.c (iseq_deoptimize): reset temperatures of each call caches on deoptimization. * vm_insnhelper.h (inc_temperature): new macro to do a saturated ++

There are cases where push push push ... pop pop pop -sort of sequence appear. And if the middle of such sequence was optimized, nop and adjuststack can be adjacent. On such situations, moving adjuststack forward to leave nop behind can trigger further optimization. * optimize.c (iseq_move_nop): swap "nop adjuststack" sequence to make room for optimization. * vm_insnhelper.c (vm_move_nop): ditto. * insns.def (adjuststack): swap nop if possible.

This is a very limited version of variable liveness analysis. Only local variables which are never referenced (assigned then made garbage immediately) are detected, and eliminated. One _could_ do more if analysis happen much earlier than what we do so now. Current approach happens after considerable amount of exectuion was already made using the compiled sequence. But analyzing earlier should impact execution of evil activities like eval. That is not acceptable as of today, so what is possible is very limited. * insns.def (setlocal): introduces new concept of per-iseq temperature. If an ISeq seems to be executed many times (or for a long time), trigger iseq analysis. * insns.def (getlocal): now subject to elimination. * insns.def (duparray): ditto. * insns.def (leave): analyze the leaving iseq. * optimize.c (iseq_eager_optimize): optimize an ISeq as a whole. Some optimizations can be possible right after it was analyzed and this function is here to do that. * optimize.c (iseq_analyze): now checks for local variable's (very restricted) liveness. Can be used for optimizations. * optimize.c (iseq_analyze_i): rename from iseq_purity, and recursively checks for child ISeq. * optimize.h (iseq_local_variable_is_writeonly): utility function to check if a local variable at given index is write only. * vm_core.h (rb_iseq_constant_body): new field to hold per-iseq temperature. * iseq.h (ISEQ_EAGER_OPTIMIZED): new ISeq flag, denoting if the ISeq have experienced eager optimization. * deoptimize.c (iseq_deoptimize): reset new fields and flags on deoptimize.

This is simply replacing an access to a constant to an immediate. This has been possible before; why we didn't is because a constant can (in spite of its constant-ness) be reassigned. Now that we can revert such optimizations, why not optimize them. * insns.def (getinlinecache): inline cache is already managing constant validity so lets just step forward. Replace itself with putobject. * optimize.c (iseq_const_fold): * optimize.c (iseq_squash): It is the third time I write this memcpy so I think it's time to refactor this into a function. * optimize.c (iseq_move_nop): use iseq_squash(). * optimize.c (iseq_eliminate_insn): ditto.

Integer binary operations are purely mathematical by default. They can be computed beforehand. Here, for each of such operations, we check for immediately shibling instructions and if it seems it is the case, replace them with the computed value that sits on stack top. * optimize.c (iseq_move_nop): generalized so that any instruction can move around. * vm_insnhelper.c (vm_move_nop): ditto. * vm_insnhelper.h (MOVE_NOP): ditto. * insns.def (putobject): subject to move around. * insns.def (adjuststack): ditto. * vm_core.h (rb_control_frame_struct): new field counting the number of continuous putobject instructions. If this value is sufficiently large, we can consider const-folding them. * tool/instruction.rb (RubyVM::VmBodyGenerator#make_footer): set/reset the new field, depending on instructions. This assigned memory region is used much much later so the assignment is mostly freely reorderable. * insns.def (opt_constfold): new (quasi-)instruction that never reaches when not optimized. Its purpose is to detect whether constant-folding is possible and if so, replace previous instructions with single putobject. * insns.def (opt_plus): subject to const fold. * insns.def (opt_minus): ditto. * insns.def (opt_mult): ditto. * insns.def (opt_div): ditto. * insns.def (opt_mod): ditto. * insns.def (opt_eq): ditto. * insns.def (opt_neq): ditto. * insns.def (opt_lt): ditto. * insns.def (opt_le): ditto. * insns.def (opt_gt): ditto. * insns.def (opt_ge): ditto.

headius · 2016-08-26T22:19:58Z

I read through this a bit and have a question: can it deopt during a method body? If I'm reading it right, the deopt can only occur immediately before entry into the method, which means off-thread changes won't be seen until you re-enter the optimized method again. Is that correct?

shyouhei · 2016-08-27T02:37:51Z

@headius very good point. My answer is yes and no. On this patch deoptimizations can occur at two points:

at the very beginning of a method, during construction of stack frame: 5231208#diff-347392b62c9edab3c8ab5fb0d71a012fR196
- This is expected to purge a stale optimized sequence.
right after a method call returns. 5231208#diff-d52fa5b396928e0544140fd48f329543R126
- Redefinitions ultimately happen inside of some C functions defined elsewhere. From a ruby script's point of view calling a C function is done via calling a method. So every time we call a method, there are chances for redefinitions. We have to make sure.

In short, inter-thread method tampering is not checked explicitly. There are chances for deoptimizations during a method execution, but not immediately when another thread changed something.

I think we can detect off-thread changes. For now we have GVL so luckily we only need to check it right after when we acquire GVL. That should suffice for now. If we decide to give up GVL, we need to introduce some sort of thread checkpoint instead.

@headius

@headius kindly pointed out that we lack considerations of multi-threads. It is true that some other thread might change something alongside, so we have to check VM state right after when we context switch. Given we have GVL, this should suffice. * thread.c (rb_threadptr_execute_interrupts): deoptimize, just in case other threads might have changed something. * internal.h (rb_vm_global_timestamp): declared. * common.mk (thread.$(OBJEXT)): add dependency.

shyouhei · 2016-08-27T04:02:28Z

pushed 2f7bfbf. It checks for VM timestamp every time context switch hapens.

chrisseaton · 2016-08-28T01:30:41Z

Another option could be to have any operation which will cause deoptimisation to do the work to deoptimise the effected methods itself, before it transfers control back to them by releasing the GVL. Then the fast path (optimised methods) doesn't have to do a check. If they're still running, then they didn't have to deoptimise.

This is an advantage of the GVL. You shouldn't have to ever check if a method should deoptimise, if whoever causes the deoptimisation does the deoptimisation work itself. It only gets slightly more complicated with parallel threads, where you need to stop the world first.

shyouhei · 2016-08-28T05:15:23Z

@chrisseaton true.

However we now don't manage which method is optimized in which way so we have to start managing that part. Because methods are subject to be GCed, management structure (if any) needs to have some sort of weak reference. I think this is a bit too large to include in this pull request.

There is another possible way to know which one to deopt; scan all methods every time deoptimization is requested. I doubt if this can be done in a timely fashion.

headius · 2016-08-29T17:06:48Z

@shyouhei Oh indeed, it should not be difficult to detect that there's been a deoptimization event across threads. The tricky bit will be deoptimizing the currently-executing iseqs without losing ipc, stack state, etc. I don't think that part exists yet, correct?

shyouhei · 2016-08-29T23:06:01Z

@headius no. I mean, you are correct. That (avoiding to touch VM states) is why we don't see considerable deopt overheads in above benchmarks. @ko1 commented in-person that I would need to juggle VM states sooner or later I continue to develop, but not yet in this pull request. It is still just patching iseqs.

NullVoxPopuli · 2016-09-22T18:44:04Z

Is there anything holding this back atm?

jeremy · 2016-09-22T20:51:38Z

Builds on Apple LLVM version 8.0.0 (clang-800.0.40) with a small change (else bus error on memcpy):

diff --git a/optimize.c b/optimize.c
index c0fee33..c9e666f 100644
--- a/optimize.c
+++ b/optimize.c
@@ -637,7 +637,7 @@ iseq_analyze(rb_iseq_t *iseq)
     }
 }

-static const VALUE wipeout_pattern[8]; /* maybe 5+2==7 should suffice? */
+static VALUE wipeout_pattern[8]; /* maybe 5+2==7 should suffice? */
 static VALUE adjuststack;
 static VALUE nop;
 static VALUE putobject;

noahgibbs · 2016-09-27T15:45:29Z

Is this moving forward? Is there anything we can do to help?

jeremy · 2016-09-27T18:01:39Z

Is there anything we can do to help?

Some suggestions:

Try to run it with your own app or other software.
Try to break it, causing a crash or incorrect behavior.
Benchmark it.
Review and critique the idea and implementation.

jamescook · 2016-10-12T13:02:54Z

I'm seeing errors when running rake's test suite:
rake_tests_output.txt

and errors when running bundle for rails:
rails_bundle_output.txt

noahgibbs · 2016-10-14T21:00:45Z

I get a different error on "make". I'm building with CLang on Mac.

deopt_make_errors.txt

I thought cfp is created per a method call and will not be reused, but it turned out to be a wrong idea because sometimes (like vm_call_opt_send situation), vm_push_frame is not called. We have to explicitly check if the cfp is not clobberd. * vm_core.h (cfp_last_insn): new field, to hold iseq's encoded body. * vm_insnhelper.h (PREPARE_FOR_ELIMINATION): store the encoded body into last insn info. * vm_insnhelper.c (vm_eliminate_insn): check if the iseq in question is the one sotred earlier.

A cache entry can become stale on occasions. That's natural, but the problem is the cache contents might include pointer(s) to already- reclaimed memory regions. By touching them we face SEGVs, so we have to avoid such situation carefully. * optimize.c (cc_is_stale): check if the cache is fresh. * optimize.c (purity_of_cc): ditto. * vm.c (ruby_vm_global_method_state): disclose.

This prevents over-optimization. A minimal example to reproduce the situation is: ``` % ruby --dump=insns -e '(true ? 0 : 1) + 1' == disasm: #<ISeq:<main>@-e>============================================ 0000 trace 1 ( 1) 0002 putobject_OP_INT2FIX_O_0_C_ 0003 jump 6 0005 putobject_OP_INT2FIX_O_1_C_ 0006 putobject_OP_INT2FIX_O_1_C_ 0007 opt_plus <callinfo!mid:+, argc:1, ARGS_SIMPLE>, <callcache> 0010 leave ``` Here, the `jump 6` jumps into a middle of putobject - putobject - send sequence. But because that sequence is subject to constant-folding, once they are replaced to `putobject 2`, address number 6 is the literal 2. So it breaks. In order to prevent we now introuce a "purposeful" nop right after a branch and now it looks like: ``` % ./miniruby --dump=insns -e '(true ? 0 : 1) + 1' == disasm: #<ISeq:<main>@-e>============================================ 0000 trace 1 ( 1) 0002 putobject_OP_INT2FIX_O_0_C_ 0003 jump 6 0005 putobject_OP_INT2FIX_O_1_C_ 0006 nopp 0007 putobject_OP_INT2FIX_O_1_C_ 0008 opt_plus <callinfo!mid:+, argc:1, ARGS_SIMPLE>, <callcache> 0011 leave ``` * insns.def (nopp): new instruction. * compile.c (iseq_compile_each): insert new nopp instruction right after a NODE_IF compilation finish, to prevent constant-folding situation. * optimize.c (purity_of_insn): take care of the new instruction.

shyouhei · 2016-10-26T10:00:18Z

OK, after close inspection I found several bugs and fixed them all. It now bundles and runs Rails 5.0.0.1, at least for me.

NullVoxPopuli · 2016-10-26T13:23:29Z

how come the appveyor test results aren't public?

noahgibbs · 2016-10-26T22:11:33Z

Still doesn't compile on Mac with CLang. I have changes locally that allow it to build but segfault. I've attached the "git diff". I'll try to get it to build and run successfully.
deopt_mac_patch.txt

shyouhei · 2016-10-27T01:29:19Z

Hmm. Thank you. I've not tested this against clang so far. Will take a look.

Beofre this we prepared every iseqs. Koichi Sasada pointed this should be a problem when there are lots of non-optimizable sequences. In order to reroute this we delay allocation of deoptimized sequences until we actually touch the sequence. * iseq.c (rb_iseq_new_with_opt): delay allocating deoptimized sequence. * optimize.c (iseq_squash): prepare just in case this is the first time we optimize this target. * optimize.c (iseq_eager_optimize): ditto. * deoptimize.h (iseq_deoptimize_if_needed): consideration of virgin iseqs.

rye · 2016-12-08T18:25:14Z

Here is the Make output when I compile using clang.

EDIT: Link to gist rather than dump all the output into here.

claudiug · 2017-06-24T21:39:47Z

any status with this one? Is almost one year old :)

mensfeld · 2017-06-26T07:06:05Z

Interested in this one as well. Any news?

nijikon · 2017-06-26T07:56:36Z

Me also.

marysieek · 2017-06-26T09:38:52Z

@pedroaliens @shyouhei any update on this issue?

benlovell · 2017-06-26T09:41:29Z

Folks, piling on doesn't help expedite an issue.

shyouhei · 2017-06-26T10:23:23Z

Hi, quick update is that it's still work in progress. I got comments from other core-devs including @ko1 , so I think I need his review to merge this, once after I finished on peaceful update.

Thank you for your interest!

NullVoxPopuli · 2017-08-04T16:15:03Z

I successfully made and installed this, but when I tried running bundle install, I got a segfault:

$ sudo bundle install # happens with and without sudo, though
Don't run Bundler as root. Bundler can ask for sudo if it is needed, and installing your
bundle as root will break this application for all non-root users on this machine.
Fetching gem metadata from https://rubygems.org/./usr/local/lib/ruby/2.4.0/net/http.rb:939: [BUG] Segmentation fault at 0x00000000000010
ruby 2.4.0dev (2016-07-22 deoptimization 55727) [x86_64-linux]

-- Control frame information -----------------------------------------------
c:0023 p:---- s:0138 e:000137 CFUNC  :wait_readable
c:0022 p:0701 s:0134 E:002290 METHOD /usr/local/lib/ruby/2.4.0/net/http.rb:939
c:0021 p:0008 s:0121 e:000120 METHOD /usr/local/lib/ruby/2.4.0/net/http.rb:869
c:0020 p:0076 s:0118 e:000117 METHOD /usr/local/lib/ruby/2.4.0/net/http.rb:864
c:0019 p:0045 s:0115 e:000114 METHOD /usr/local/lib/ruby/gems/2.4.0/gems/bundler-1.15.3/lib/bundler/vendor/net-http-persistent/lib/net/http/persistent.rb:702
c:0018 p:0530 s:0110 E:001a50 METHOD /usr/local/lib/ruby/gems/2.4.0/gems/bundler-1.15.3/lib/bundler/vendor/net-http-persistent/lib/net/http/persistent.rb:633
c:0017 p:0041 s:0097 e:000096 METHOD /usr/local/lib/ruby/gems/2.4.0/gems/bundler-1.15.3/lib/bundler/vendor/net-http-persistent/lib/net/http/persistent.rb:996
c:0016 p:0148 s:0084 e:000083 METHOD /usr/local/lib/ruby/gems/2.4.0/gems/bundler-1.15.3/lib/bundler/fetcher/downloader.rb:50
c:0015 p:0048 s:0075 e:000074 METHOD /usr/local/lib/ruby/gems/2.4.0/gems/bundler-1.15.3/lib/bundler/fetcher/downloader.rb:16
c:0014 p:0028 s:0067 e:000066 METHOD /usr/local/lib/ruby/gems/2.4.0/gems/bundler-1.15.3/lib/bundler/fetcher/compact_index.rb:116
c:0013 p:0161 s:0061 e:000060 BLOCK  /usr/local/lib/ruby/gems/2.4.0/gems/bundler-1.15.3/lib/bundler/compact_index_client/updater.rb:50
c:0012 p:0048 s:0054 e:000053 METHOD /usr/local/lib/ruby/2.4.0/tmpdir.rb:89
c:0011 p:0026 s:0047 e:000046 METHOD /usr/local/lib/ruby/gems/2.4.0/gems/bundler-1.15.3/lib/bundler/compact_index_client/updater.rb:30
c:0010 p:0068 s:0040 e:000039 METHOD /usr/local/lib/ruby/gems/2.4.0/gems/bundler-1.15.3/lib/bundler/compact_index_client.rb:81
c:0009 p:0151 s:0035 e:000034 METHOD /usr/local/lib/ruby/gems/2.4.0/gems/bundler-1.15.3/lib/bundler/compact_index_client.rb:97

claudiug · 2018-08-26T20:24:55Z

Hello, now this PR is here for 2 years. Any news?

shyouhei · 2018-08-27T03:47:00Z

This pull request was a bit too big for other core devs to review. I changed my strategy to push things forward. As you see above several parts of this pull request were separated then proposed. Some of them had already been merged, to gain ~1% speedup already.

Right now I am preparing another single-cut branch which brings some more boost. Yet to be pull requested because it fails some tests right now. I hope it can be rerouted soon.

shyouhei · 2019-09-09T09:27:43Z

Like I wrote above we are not going to merge this pull request as-is. So close now.

shyouhei added 14 commits August 16, 2016 20:55

shyouhei changed the title ~~Deoptimization~~ Deoptimization Engine Aug 26, 2016

hsbt assigned shyouhei Oct 15, 2016

shyouhei added 3 commits October 26, 2016 18:50

fnandes approved these changes Nov 13, 2016

View reviewed changes

matzbot force-pushed the trunk branch from 2677ddd to ce7ad3a Compare January 18, 2018 15:26

This was referenced Jan 31, 2018

new VM timestamp variable #1804

Closed

24% reduction of compiled vm_exec_core function #1779

Closed

Rewrite VM generator #1783

Closed

shyouhei mentioned this pull request Aug 30, 2018

Omit executing methods/blocks that are side-effect free #1943

Closed

shyouhei mentioned this pull request Mar 19, 2019

send-pop optimisation #2100

Closed

k0kubun changed the base branch from trunk to master August 15, 2019 17:37

shyouhei closed this Sep 9, 2019

shyouhei deleted the deoptimization branch September 12, 2019 02:28

jamatthews mentioned this pull request Oct 12, 2019

Eagerly evaluate const functions at parse time artichoke/artichoke#300

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deoptimization Engine #1419

Deoptimization Engine #1419

shyouhei commented Aug 26, 2016

headius commented Aug 26, 2016

shyouhei commented Aug 27, 2016 •

edited

shyouhei commented Aug 27, 2016

chrisseaton commented Aug 28, 2016

shyouhei commented Aug 28, 2016

headius commented Aug 29, 2016

shyouhei commented Aug 29, 2016 •

edited

NullVoxPopuli commented Sep 22, 2016

jeremy commented Sep 22, 2016

noahgibbs commented Sep 27, 2016

jeremy commented Sep 27, 2016 •

edited

jamescook commented Oct 12, 2016

noahgibbs commented Oct 14, 2016

shyouhei commented Oct 26, 2016

NullVoxPopuli commented Oct 26, 2016

noahgibbs commented Oct 26, 2016

shyouhei commented Oct 27, 2016

rye commented Dec 8, 2016 •

edited

claudiug commented Jun 24, 2017

mensfeld commented Jun 26, 2017

nijikon commented Jun 26, 2017

marysieek commented Jun 26, 2017

benlovell commented Jun 26, 2017

shyouhei commented Jun 26, 2017

NullVoxPopuli commented Aug 4, 2017 •

edited

claudiug commented Aug 26, 2018

shyouhei commented Aug 27, 2018

shyouhei commented Sep 9, 2019

Deoptimization Engine #1419

Deoptimization Engine #1419

Conversation

shyouhei commented Aug 26, 2016

Abstract

Introduction

The deoptimization strategy

Implemented optimizations

Constant folding

send elimination

Elimination of variable assignments

Experiments

Conclusions

Future works

headius commented Aug 26, 2016

shyouhei commented Aug 27, 2016 • edited

shyouhei commented Aug 27, 2016

chrisseaton commented Aug 28, 2016

shyouhei commented Aug 28, 2016

headius commented Aug 29, 2016

shyouhei commented Aug 29, 2016 • edited

NullVoxPopuli commented Sep 22, 2016

jeremy commented Sep 22, 2016

noahgibbs commented Sep 27, 2016

jeremy commented Sep 27, 2016 • edited

jamescook commented Oct 12, 2016

noahgibbs commented Oct 14, 2016

shyouhei commented Oct 26, 2016

NullVoxPopuli commented Oct 26, 2016

noahgibbs commented Oct 26, 2016

shyouhei commented Oct 27, 2016

rye commented Dec 8, 2016 • edited

claudiug commented Jun 24, 2017

mensfeld commented Jun 26, 2017

nijikon commented Jun 26, 2017

marysieek commented Jun 26, 2017

benlovell commented Jun 26, 2017

shyouhei commented Jun 26, 2017

NullVoxPopuli commented Aug 4, 2017 • edited

claudiug commented Aug 26, 2018

shyouhei commented Aug 27, 2018

shyouhei commented Sep 9, 2019

`send` elimination

shyouhei commented Aug 27, 2016 •

edited

shyouhei commented Aug 29, 2016 •

edited

jeremy commented Sep 27, 2016 •

edited

rye commented Dec 8, 2016 •

edited

NullVoxPopuli commented Aug 4, 2017 •

edited