Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deoptimization Engine #1419

Closed
wants to merge 23 commits into from
Closed

Deoptimization Engine #1419

wants to merge 23 commits into from

Conversation

shyouhei
Copy link
Member

Abstract

Implemented a way to optimize ruby's executions and to revert them. The strategy is restricted so that any VM states like program counter(s) would not be affected by the modifications. This restriction makes deoptimize much lightweight. Experiments show optimizations on this mechanism speeds up micro benchmarks, yet has no significant overheads for pitfall-ish activities, like eval.

Introduction

Ruby is said to be slow[1]. While there can be arguments on this, experiments show that it is at least not the fastest language to execute[2]. …

The reason behind this is fingered as its GC being slow, or its GVL, or its dynamic nature. But the author would like to point out the most important thing that Ruby lacks; it does not even try to optimize its execution. We believe this is the root cause of the slowness we suffer.

We wrote "try to" above. This is the essential concept of this pull request. Why Ruby did not optimize so far was that its execution might be different on occasions. For instance a 1 + 2 might not always be 3, because the + can be redefined on-the-fly. So the interpreter does not constant-fold the expression and looks for Integer#+ every time.

This redefinition however, rarely happens. Almost every time when 1 + 2 is evaluated, its value is arguably 3. Given this empirical assessment, it makes sense for Ruby to first try 3 and if Integer#+ was redefined for any reason, fall back to that. It should be much faster for vast majority of normal (not redefined) cases, while it does not break redefinition with some overheads.

This sort of technique is called deoptimization[3]. Deoptimization is done more or less on other languages / implementations. Notably JVM related projects has many research in this area. For instance JRuby+Truffle already has rich application of this technique[4].

What is pull-requested here is not the direct C translation of JRuby+Truffle's. Let me describe what is going on.

The deoptimization strategy

Since 1.9, Ruby's interpreter equips a language VM (visible as ::RubyVM from inside ruby scripts). It first generates sequences ("iseq"s hereafter) of VM instructions ("insn"s hereafter), and executes them in series. The key structure of this mechanism is called struct rb_iseq_constant_body.

struct rb_iseq_constant_body {
    (…snip…)

    unsigned int iseq_size;
    const VALUE *iseq_encoded; /* encoded iseq (insn addr and operands) */

    (…snip…)
};

The iseq_encoded field is what VM scans to execute. This is where we want to optimize. Prior to do any modifications, we hold a backup of it using memcpy. In case of deoptimization event, we just overwrite this memory region with backed-up sequence using memcpy to revert whatever optimizations that might have been applied to it.

The upside of this strategy is that it is fairly simple. Because everything is written in pure C code we have no portability concerns. Also it is so simple that the deoptimization does not involve the VM states at all, including program counters. This has advantage when execution of instructions recurs; if we had to update program counters, we must scan the whole VM stacks to keep consistencies between before / after deoptimization process.

The downside of this strategy is that it restricts what can be optimized. For instance we cannot reorder instructions, because exceptions might raise inside of reordered sequences. We cannot change the length of iseq, because the underlying memory region cannot be reallocated.

However, even with the given restriction, we can do something.

Implemented optimizations

We have following optimizations in this pull request.

Constant folding

For instance a 1 + 2 is optimized into 3 like this.

--- /dev/shm/xj79gt      2016-08-17 17:09:31.000000000 +0900
+++ /dev/shm/1gaaeo      2016-08-17 17:09:31.000000000 +0900
@@ -20,8 +20,10 @@ local table (size: 2, argc: 0 [opts: 0,
 |------------------------------------------------------------------------
 0000 trace            256                                             (   7)
 0002 trace            1                                               (   8)
-0004 putobject_OP_INT2FIX_O_1_C_
-0005 putobject        2
-0007 opt_plus         <callinfo!mid:+, argc:1, ARGS_SIMPLE>, <callcache>
+0004 putobject        3
+0006 nop
+0007 nop
+0008 nop
+0009 nop
 0010 trace            512                                             (   9)
 0012 leave                                                            (   8)

The nops are introduced to meet "no PC modification" restriction. In order to implement this, the opt_plus instruction is modified to detect a putobject-putobject-opt-plus sequence. Then it calculates the computation as usual, and overwrites itself with the sequence like above diff.

Ruby-level constants (like ::Object) are also subject to fold. They are resolved and replaced to putobject with the identical value.

send elimination

Typical ruby script consists of many method invocations. They are represented in send or its variant instructions in an iseq. If they are optimized, the effect should not be marginal. The problem is a method call is not always optimizable. We have to determine whether we can skip calling.

In order to do so a method is called as usual for the first time, and checked whether it has any side effects or not. If a method does not contain any side effects, it (more precisely, its underlying iseq) is marked as being "pure". Next time when the same method is called, we can consider eliminating the call to it.

However a method that is not pure is not always NG to optimize. A method can of course call another method. A method that is not pure can be simply not checked yet. So a method has 3 states namely "pure", "not pure", and "unpredictable". Every methods start their life as being unpredictable. Once its calling methods are all predicted to be pure, and itself has no side-effects, the state is updated to be pure.

enum insn_purity {
    insn_is_pure          = Qtrue,
    insn_is_not_pure      = Qfalse,
    insn_is_unpredictable = Qnil,
};

Side effect that we care includes:

  • Writes to variables other than local ones,
  • Definition of classes,
  • Call of methods that are not pure,
  • Call of C functions,
    • C functions are unable to check so we think they are all NG.
  • Yielding blocks.

Once we know a method is pure, it is safe to eliminate a call to that method which immediately discards its return value(s). Note however, that even on such situations, evaluations of method arguments are not always eligible to be eliminated. They can have their own side effects. We just eliminate the send variant instruction and not its caller site setup, like following diff:

--- /dev/shm/165rrgd     2016-08-17 10:44:10.000000000 +0900
+++ /dev/shm/jd0rcj      2016-08-17 10:44:10.000000000 +0900
@@ -23,8 +23,10 @@ local table (size: 2, argc: 0 [opts: 0,
 0004 putself
 0005 putself
 0006 opt_send_without_block <callinfo!mid:n, argc:0, FCALL|VCALL|ARGS_SIMPLE>, <callcache>
-0009 opt_send_without_block <callinfo!mid:m, argc:1, FCALL|ARGS_SIMPLE>, <callcache>
-0012 adjuststack      1
+0009 adjuststack      2
+0011 nop
+0012 nop
+0013 nop
 0014 trace            1                                               (  16)
 0016 putnil
 0017 trace            512                                             (  17)

This is how a m(n()) is optimized, where method m is pure but n is not.

Elimination of variable assignments

As we see, method calls can not be eliminated when its return value is used. In order to relax this we would like to skip assignments of variables if possible.

However this is not easy to do precisely. A variable's liveness shall be analyzed. All types of variables except local ones are almost unable to analyze. Also because we have restriction on what is allowed here due to deoptimization process, massive iseq rearrangements like SSA conversion is not doable.

In this pull request a very limited version of variable liveness analysis is implemented. Only
local variables which are never referenced (assigned then made garbage immediately) are detected, then eliminated if possible. This is not the only thing that could be done theoretically, but practically the easiest to detect so we do this as a bridgehead to further optimizations.

Note we have to check iseqs recursively; because local variables are reachable from inside of a block, and a block is represented in a separate iseq, which means we have to check all possible iseqs that can reach a local variable to ensure no references to it exists.

After checking for usages, a useless assignment can be eliminated like this:

--- /dev/shm/1p1dusf     2016-08-25 18:06:58.000000000 +0900
+++ /dev/shm/1kpw0qe     2016-08-25 18:06:58.000000000 +0900
@@ -17,10 +17,14 @@ local table (size: 3, argc: 1 [opts: 0,
 [ 3] i<Arg>     [ 2] x
 0000 trace            256                                             (  12)
 0002 trace            1                                               (  13)
-0004 putself
-0005 getlocal_OP__WC__0 3
-0007 opt_send_without_block <callinfo!mid:m, argc:1, FCALL|ARGS_SIMPLE>, <callcache>
-0010 setlocal_OP__WC__0 2
+0004 nop
+0005 nop
+0006 nop
+0007 nop
+0008 nop
+0009 nop
+0010 nop
+0011 nop
 0012 trace            1                                               (  14)
 0014 putnil
 0015 trace            512                                             (  15)

Experiments

To determine effectiveness of this approach we took experiments on my environment. This machine equips Intel(R) Core(TM)2 Duo CPU T7700 CPU and runs Ubuntu 16.04. The image below is a make benchmark result against ruby 2.4.0dev.

Looking at the benchmark, the result is (generally speaking) similar to 2.4, with several exceptional cases which speeds up. The most improved vm2_bigarray* got huge speedup because generation of a big array is eliminated at all. Other improved benchmarks shows optimizations described above works quite well.

It is worth noting that the vm2_eval* result is 0.933 i.e. eval is only 2.7% slow compared to no optimization. This can be considered as deoptimization preparation cost, and it means overhead is quite small.

Conclusions

Implemented a deoptimization engine for ruby and several optimizations on top of it. The proposed deoptimization strategy is lightweight so that activities like eval works without huge overheads.

The author is not willing to argue this is the only best optimization technique that we know. Rather, even if it is suboptimal, the benchmark results show that the optimizations proposed here gains considerable amount of speed up. This is why ruby is said to be slow; it is not optimized at all. Any trivial optimizations should significantly impact it.

We believe ruby should have optimizations implemented.

Future works

More optimizations can be thought on top of this deoptimization engine. For instance branches with constant expressions could be relaxed to unconditional jumps. More analysis on variable liveness could eliminate more of them. Perversely, we could think of adding new local variables, like to eliminate common subexpressions.

The deoptimization strategy is subject to be improved. If we can suffer more slowdown of eval, more fine-grained setup like three-address code could be thought of. In order to do so a deoptimization engine should be able to revert IR conversion.

To suppress warning, add attribute if possible.

	* template/insns_info.inc.tmpl (insn_len): now that this file is
	  included from more than one file, chances are that those static
	  functions can be left unused depending on compilation unit.  On
	  such situation in order to prevent compiler warning we annotate
	  this function being possibly unused.

	* template/insns_info.inc.tmpl (insn_name): ditto.

	* template/insns_info.inc.tmpl (insn_op_type): ditto.

	* template/insns_info.inc.tmpl (insn_op_types): ditto.

	* configure.in (MAYBE_UNUSED): check if unused attribute is
	  supported by the compiler.
This is important because signed operations shall not overflow in C.
If you want it rouldtrip you can just cast to unsigned then cast back,
but if you want to avoid overflows completely, I believe this is the
fastest method.

	* internal.h (SATURATION_ADD): new macro that does addition of two
	  variables in satirated integer arithmetic.

	* configure.in: check for __builtin_add_overflow compiler
	  intrinsic which is available in recent gcc and clang, with
	  expectedly-optimal implementation of overflow detection.
This is handy when the current insn is auto-generated (e.g. operand
unification).  This by itself has no runtime overheads.

	* tool/instruction.rb (RubyVM::Instruction#sp_increase_macro_expr):
	  new method, to define a macro inside of each insn.

  	* tool/instruction.rb (RubyVM::VmBodyGenerator#make_header_defines):
	  a bunch of new macros to define, which are (mostly) compile-time
  	  constants, that characterize many of instructions, such as its
  	  name.

  	* tool/instruction.rb (RubyVM::VmBodyGenerator#make_footer_undefs):
	  clean up the new macros.
The struct rb_id_table can hold arbitrary VALUE values.  Now that this
struct is reachable form Ruby's object space, it must understand what GC
requests to it.

	* id_table.c (rb_id_table_mark): new function to avoid GC leak for
	  id tables.  This function applies against all implementations.

	* id_table.h (rb_id_table_mark): ditto.
This struct is ID-to-VALUE mapping.  Which must be convertible to a Hash.

	* id_table.h (rb_id_table_to_h): new function to convert a struct
	  rb_id_table to a Hash.  Expected to be used for inspection
	  purpose from inside a ruby script.

	* id_table.c (rb_id_table_to_h): ditto.
This implements arbitrary key-value pair on each ISeq, that are meant
to be used when optimize.  Much like C++11's generalized attributes.

Note that, as this is meant to be optimization purpose, you cant
modify the attributes from a ruby script.  The ruby level API is for
debug purpose.

	* vm_core.h (rb_iseq_constant_body): new attributes field to hold
	  arbitrary key-value for any ISeq.

	* iseq.c (rb_iseq_annotate): ISeq attribute setter.  This is
	  intentionally not exposed to 3rd parties now, to prevent
	  potentially malicious usage of this feature.

	* iseq.c (rb_iseq_annotated_p): obtain specific attribute of an
	  ISeq.

	* iseq.c (iseqw_attributes): obtain attributes hash of an ISeq.
This variable is expected to be an integer type which can be incremented
atomically.  Expected to be used where certain object's "freshness" is
vital, e.g. when invalidating a cache.

	* vm.c (ruby_vm_global_timestamp): new variable to hold per-VM
	  timestamp counter.

	* vm_insnhelper.h (INC_GLOBAL_TIMESTAMP): macro to increase timestamp.

	* vm_method.c (rb_clear_constant_cache): increase timestamp.

	* vm_method.c (rb_clear_method_cache_by_class): ditto.

	* vm.c (rb_next_class_serial): ditto.
This commit adds a new file named deoptimize.c, which basically saves
yet-to-optimize instruction sequence aside, and reverts to that state on
occasions.

The "state" here means the set of ISeq raw sequence and program counter.
Or, the "optimization" assumes that everything other than that two are
shared among optimized / non-optimized variant.  Examples of such info
includes for instance catch table; so under this infrastructure you
cannot optimize an iseq which deals exceptions yet.  That is to be
developed later.

There is no corresponding on-the-fly optimizer now; this commit only
adds overheads.  If you take a benchmark on this commit it should be
(hopefully slightly) slower than the original.  Any optimizations that
utilizes this mechanism must be sufficiently faster than the overhead
introduced in this commit.

	* deoptimize.h: new file.

	* deoptimize.c: new file.

	* common.mk (COMMONOBJS): dependencies for new files.

	* vm_core.h (rb_iseq_constant_body): new field.

	* iseq.c (rb_iseq_free): proper handling of new field.

	* iseq.c (iseq_memsize): ditto.

	* vm_insnhelper.c (vm_check_iseq_freshness): new function to check
	  if given iseq is stale; if so, deoptimization is kicked.

	* vm_insnhelper.c (vm_push_frame): called from here right at the
	  beginning of an iseq execution.

	* vm_insnhelper.h (CALL_METHOD): also, a method call can result in
	  redefinition(s) of methods, like by requiring something.  At a
	  graceful return from a method we shall check freshness of
	  current iseq too.

	* vm.c (rb_vm_global_timestamp): export this variable.

	* iseq.c (rb_iseq_new_with_opt): prepare deoptimiation
This commit adds on-the-fly ISeq analyzer.  It detects an ISeq's
purity, i.e. if that ISeq has side-effect or not.  Purity is the key
concept of whole optimization techniques in general, but in Ruby it is
yet more important because there is a method called eval.  A pure ISeq
is free from eval, while those not pure are stuck in the limbo where
any of its side effects _could_ result in (possibly aliased) call to
eval.  So an optimization tend not be possible against them.

Note however, that the analyzer cannot statically say if the ISeq in
question is pure or not.  It categorizes an ISeq into 3 states namely
pure, not pure, or "unpredictable".  The last category is used when
for instance there are branches yet to be analyzed, or method calls to
another unpredictable ISeq.

An ISeq's purity changes over time, not only by redefinition of
methods, but by other optimizations, like, by entering a rarely-taken
branch of a formerly-unpredictable ISeq to kick analyzer to fix its
purity.  Such change propagates to its callers.

	* optimize.c: new file.

	* optimize.h: new file.

	* common.mk (COMMONOBJS): dependencies for new files.

	* iseq.h (ISEQ_NEEDS_ANALYZE): new flag to denote the iseq in
	  question might need (re)analyzing.
Introduces a mechanism to replace a pair of send + pop instructions
into strength-reduced equivalent of adjuststack + nop(s), when that
transformation is guaranteed possible.  When some method redefinition
happens to break that assumption, it falls back to its original
sequence.

	* insns.def (adjuststack): kind of "eat"s immediately preceding
	  instruction to move itself into there.  This is the key part of
	  the whole optimization we implement in this commit.

	* insns.def (send): now subject to be eliminated. When there is a
	  sequence of "send then pop", and that send does not include any
	  side effects, we can safely ignore those two instructions.

	* insns.def (opt_send_without_block): ditto.

	* insns.def (invokesuper): ditto.

	* insns.def (putnil): this insn is already done in
	  iseq_peephole_optimization() before execution.  But while
	  optimization process goes on during runtime, additional places
	  where this insn can be considered can appear.  We would like to
	  wipe them out also.

	* insns.def (putself): ditto.

	* insns.def (putobject): ditto.

	* insns.def (putstring): ditto.

	* insns.def (concatstringd): ditto.

	* insns.def (freezestring): ditto.

	* insns.def (toregexp): ditto.

	* insns.def (newarray): ditto.

	* insns.def (duparray): ditto.

	* insns.def (newhash): ditto.

	* insns.def (newrange): ditto.

	* compile.c (iseq_peephole_optimize): replace pop with
	  adjuststack 1; possibility is this instruction can "eat" the
	  immediately preceding instruction to increase the operand.

	* compile.c (REPLACE_ELEM): resurrect, because it now has a
	  client.

	* vm.c: add necessary header files.

	* common.mk (vm.$(OBJEXT)): update to follow new dependencies.

	* optimize.c (wipeout_pattern): pattern buffer, filled with nop.
	  Technically it is interesting because the pattern filled in is a
	  pointer to a function-static variable of another function; it is
	  breaking variable visibility.

	* optimize.c (construct_pattern): constructor function to fill in
	  static pattern.

	* optimize.c (iseq_eliminate_insn): fill the pattern to patch the
	  memory region the program counter currently points to.  This is
	  where the actual "optimization" takes place.

	* vm_core.h (cfp_last_insn): latest instruction's place, length,
	  and stack increase.  Note that it does not hold a direct
	  pointer, because deoptimization can change instruction sequence
	  during creation of this field and its use.

	* vm_core.h (rb_call_cache): new field to hold the "temperature"
	  of the call site this cache represents.  This field is
	  explicitly declared signed because we ned sub-zero temperature.

	* vm_insnhelper.h (PREPARE_FOR_ELIMINATION): remember the
	  instruction currently running to CFP, for possible later
	  elimination.

	* vm_insnhelper.h (PREPARE_ELIMINATE_SENDISH): a variant, which
	  remembers the current instruction only when it is hot.

	* vm_insnhelper.c (vm_eliminate_insn): because this optimization
	  routine resides in a super duper hot path it is worth inlining
	  some checks before actual function call is issued.  This
	  function is placed here for that reason.

	* vm_insnhelper.c (vm_is_hot): check if it is safe to optimize the
	  given caller site.  Note that the safety is tested against what
	  is about to be called, not the currently running ISeq.

	* deoptimize.c (iseq_deoptimize): reset temperatures of each call
	  caches on deoptimization.

	* vm_insnhelper.h (inc_temperature): new macro to do a saturated ++
There are cases where push push push ... pop pop pop -sort of sequence
appear.  And if the middle of such sequence was optimized, nop and
adjuststack can be adjacent.  On such situations, moving adjuststack
forward to leave nop behind can trigger further optimization.

	* optimize.c (iseq_move_nop):  swap "nop adjuststack" sequence
	  to make room for optimization.

	* vm_insnhelper.c (vm_move_nop): ditto.

	* insns.def (adjuststack): swap nop if possible.
This is a very limited version of variable liveness analysis.  Only
local variables which are never referenced (assigned then made garbage
immediately) are detected, and eliminated.

One _could_ do more if analysis happen much earlier than what we do so
now.  Current approach happens after considerable amount of exectuion
was already made using the compiled sequence.  But analyzing earlier
should impact execution of evil activities like eval.  That is not
acceptable as of today, so what is possible is very limited.

	* insns.def (setlocal): introduces new concept of per-iseq
	  temperature.  If an ISeq seems to be executed many times (or for
	  a long time), trigger iseq analysis.

	* insns.def (getlocal): now subject to elimination.

	* insns.def (duparray): ditto.

	* insns.def (leave): analyze the leaving iseq.

	* optimize.c (iseq_eager_optimize): optimize an ISeq as a whole.
	  Some optimizations can be possible right after it was analyzed
	  and this function is here to do that.

	* optimize.c (iseq_analyze): now checks for local variable's
	  (very restricted) liveness.  Can be used for optimizations.

	* optimize.c (iseq_analyze_i): rename from iseq_purity, and
	  recursively checks for child ISeq.

	* optimize.h (iseq_local_variable_is_writeonly): utility function
	  to check if a local variable at given index is write only.

	* vm_core.h (rb_iseq_constant_body): new field to hold per-iseq
	  temperature.

	* iseq.h (ISEQ_EAGER_OPTIMIZED): new ISeq flag, denoting if the
	  ISeq have experienced eager optimization.

	* deoptimize.c (iseq_deoptimize): reset new fields and flags on
	  deoptimize.
This is simply replacing an access to a constant to an immediate.
This has been possible before; why we didn't is because a constant can
(in spite of its constant-ness) be reassigned.  Now that we can revert
such optimizations, why not optimize them.

	* insns.def (getinlinecache): inline cache is already managing
	  constant validity so lets just step forward.  Replace itself
	  with putobject.

	* optimize.c (iseq_const_fold):

	* optimize.c (iseq_squash): It is the third time I write this
	  memcpy so I think it's time to refactor this into a function.

	* optimize.c (iseq_move_nop): use iseq_squash().

	* optimize.c (iseq_eliminate_insn): ditto.
Integer binary operations are purely mathematical by default.  They
can be computed beforehand.  Here, for each of such operations, we
check for immediately shibling instructions and if it seems it is the
case, replace them with the computed value that sits on stack top.

	* optimize.c (iseq_move_nop): generalized so that any instruction
	  can move around.

	* vm_insnhelper.c (vm_move_nop): ditto.

	* vm_insnhelper.h (MOVE_NOP): ditto.

	* insns.def (putobject): subject to move around.

	* insns.def (adjuststack): ditto.

	* vm_core.h (rb_control_frame_struct): new field counting the
	  number of continuous putobject instructions.  If this value is
	  sufficiently large, we can consider const-folding them.

	* tool/instruction.rb (RubyVM::VmBodyGenerator#make_footer):
	  set/reset the new field, depending on instructions.  This
	  assigned memory region is used much much later so the assignment
	  is mostly freely reorderable.

	* insns.def (opt_constfold): new (quasi-)instruction that never
	  reaches when not optimized.  Its purpose is to detect whether
	  constant-folding is possible and if so, replace previous
	  instructions with single putobject.

	* insns.def (opt_plus): subject to const fold.

	* insns.def (opt_minus): ditto.

	* insns.def (opt_mult): ditto.

	* insns.def (opt_div): ditto.

	* insns.def (opt_mod): ditto.

	* insns.def (opt_eq): ditto.

	* insns.def (opt_neq): ditto.

	* insns.def (opt_lt): ditto.

	* insns.def (opt_le): ditto.

	* insns.def (opt_gt): ditto.

	* insns.def (opt_ge): ditto.
@shyouhei shyouhei changed the title Deoptimization Deoptimization Engine Aug 26, 2016
@headius
Copy link
Contributor

headius commented Aug 26, 2016

I read through this a bit and have a question: can it deopt during a method body? If I'm reading it right, the deopt can only occur immediately before entry into the method, which means off-thread changes won't be seen until you re-enter the optimized method again. Is that correct?

@shyouhei
Copy link
Member Author

shyouhei commented Aug 27, 2016

@headius very good point. My answer is yes and no. On this patch deoptimizations can occur at two points:

  • at the very beginning of a method, during construction of stack frame: 5231208#diff-347392b62c9edab3c8ab5fb0d71a012fR196
    • This is expected to purge a stale optimized sequence.
  • right after a method call returns. 5231208#diff-d52fa5b396928e0544140fd48f329543R126
    • Redefinitions ultimately happen inside of some C functions defined elsewhere. From a ruby script's point of view calling a C function is done via calling a method. So every time we call a method, there are chances for redefinitions. We have to make sure.

In short, inter-thread method tampering is not checked explicitly. There are chances for deoptimizations during a method execution, but not immediately when another thread changed something.

I think we can detect off-thread changes. For now we have GVL so luckily we only need to check it right after when we acquire GVL. That should suffice for now. If we decide to give up GVL, we need to introduce some sort of thread checkpoint instead.

@headius kindly pointed out that we lack considerations of
multi-threads.  It is true that some other thread might change
something alongside, so we have to check VM state right after when we
context switch.

Given we have GVL, this should suffice.

	* thread.c (rb_threadptr_execute_interrupts): deoptimize, just in
	  case other threads might have changed something.

	* internal.h (rb_vm_global_timestamp): declared.

	* common.mk (thread.$(OBJEXT)): add dependency.
@shyouhei
Copy link
Member Author

pushed 2f7bfbf. It checks for VM timestamp every time context switch hapens.

@chrisseaton
Copy link
Contributor

Another option could be to have any operation which will cause deoptimisation to do the work to deoptimise the effected methods itself, before it transfers control back to them by releasing the GVL. Then the fast path (optimised methods) doesn't have to do a check. If they're still running, then they didn't have to deoptimise.

This is an advantage of the GVL. You shouldn't have to ever check if a method should deoptimise, if whoever causes the deoptimisation does the deoptimisation work itself. It only gets slightly more complicated with parallel threads, where you need to stop the world first.

@shyouhei
Copy link
Member Author

@chrisseaton true.

However we now don't manage which method is optimized in which way so we have to start managing that part. Because methods are subject to be GCed, management structure (if any) needs to have some sort of weak reference. I think this is a bit too large to include in this pull request.

There is another possible way to know which one to deopt; scan all methods every time deoptimization is requested. I doubt if this can be done in a timely fashion.

@headius
Copy link
Contributor

headius commented Aug 29, 2016

@shyouhei Oh indeed, it should not be difficult to detect that there's been a deoptimization event across threads. The tricky bit will be deoptimizing the currently-executing iseqs without losing ipc, stack state, etc. I don't think that part exists yet, correct?

@shyouhei
Copy link
Member Author

shyouhei commented Aug 29, 2016

@headius no. I mean, you are correct. That (avoiding to touch VM states) is why we don't see considerable deopt overheads in above benchmarks. @ko1 commented in-person that I would need to juggle VM states sooner or later I continue to develop, but not yet in this pull request. It is still just patching iseqs.

@NullVoxPopuli
Copy link

Is there anything holding this back atm?

@jeremy
Copy link

jeremy commented Sep 22, 2016

Builds on Apple LLVM version 8.0.0 (clang-800.0.40) with a small change (else bus error on memcpy):

diff --git a/optimize.c b/optimize.c
index c0fee33..c9e666f 100644
--- a/optimize.c
+++ b/optimize.c
@@ -637,7 +637,7 @@ iseq_analyze(rb_iseq_t *iseq)
     }
 }

-static const VALUE wipeout_pattern[8]; /* maybe 5+2==7 should suffice? */
+static VALUE wipeout_pattern[8]; /* maybe 5+2==7 should suffice? */
 static VALUE adjuststack;
 static VALUE nop;
 static VALUE putobject;

@noahgibbs
Copy link
Contributor

Is this moving forward? Is there anything we can do to help?

@jeremy
Copy link

jeremy commented Sep 27, 2016

Is there anything we can do to help?

Some suggestions:

  1. Try to run it with your own app or other software.
  2. Try to break it, causing a crash or incorrect behavior.
  3. Benchmark it.
  4. Review and critique the idea and implementation.

@jamescook
Copy link

I'm seeing errors when running rake's test suite:
rake_tests_output.txt

and errors when running bundle for rails:
rails_bundle_output.txt

@noahgibbs
Copy link
Contributor

I get a different error on "make". I'm building with CLang on Mac.

deopt_make_errors.txt

I thought cfp is created per a method call and will not be reused, but
it turned out to be a wrong idea because sometimes (like
vm_call_opt_send situation), vm_push_frame is not called.  We have to
explicitly check if the cfp is not clobberd.

	* vm_core.h (cfp_last_insn): new field, to hold iseq's encoded
	  body.

	* vm_insnhelper.h (PREPARE_FOR_ELIMINATION): store the encoded
	  body into last insn info.

	* vm_insnhelper.c (vm_eliminate_insn): check if the iseq in
	  question is the one sotred earlier.
A cache entry can become stale on occasions.  That's natural, but the
problem is the cache contents might include pointer(s) to already-
reclaimed memory regions.  By touching them we face SEGVs, so we have
to avoid such situation carefully.

	* optimize.c (cc_is_stale): check if the cache is fresh.

	* optimize.c (purity_of_cc): ditto.

	* vm.c (ruby_vm_global_method_state): disclose.
This prevents over-optimization.  A minimal example to reproduce the
situation is:

```
% ruby --dump=insns -e '(true ? 0 : 1) + 1'
== disasm: #<ISeq:<main>@-e>============================================
0000 trace            1                                               (   1)
0002 putobject_OP_INT2FIX_O_0_C_
0003 jump             6
0005 putobject_OP_INT2FIX_O_1_C_
0006 putobject_OP_INT2FIX_O_1_C_
0007 opt_plus         <callinfo!mid:+, argc:1, ARGS_SIMPLE>, <callcache>
0010 leave
```

Here, the `jump 6` jumps into a middle of putobject - putobject - send
sequence.  But because that sequence is subject to constant-folding,
once they are replaced to `putobject 2`, address number 6 is the
literal 2.  So it breaks.

In order to prevent we now introuce a "purposeful" nop right after a
branch and now it looks like:

```
% ./miniruby --dump=insns -e '(true ? 0 : 1) + 1'
== disasm: #<ISeq:<main>@-e>============================================
0000 trace            1                                               (   1)
0002 putobject_OP_INT2FIX_O_0_C_
0003 jump             6
0005 putobject_OP_INT2FIX_O_1_C_
0006 nopp
0007 putobject_OP_INT2FIX_O_1_C_
0008 opt_plus         <callinfo!mid:+, argc:1, ARGS_SIMPLE>, <callcache>
0011 leave
```

	* insns.def (nopp): new instruction.

	* compile.c (iseq_compile_each): insert new nopp instruction right
	  after a NODE_IF compilation finish, to prevent constant-folding
	  situation.

	* optimize.c (purity_of_insn): take care of the new instruction.
@shyouhei
Copy link
Member Author

OK, after close inspection I found several bugs and fixed them all. It now bundles and runs Rails 5.0.0.1, at least for me.

@NullVoxPopuli
Copy link

how come the appveyor test results aren't public?

@noahgibbs
Copy link
Contributor

Still doesn't compile on Mac with CLang. I have changes locally that allow it to build but segfault. I've attached the "git diff". I'll try to get it to build and run successfully.
deopt_mac_patch.txt

@shyouhei
Copy link
Member Author

Hmm. Thank you. I've not tested this against clang so far. Will take a look.

Beofre this we prepared every iseqs.  Koichi Sasada pointed this
should be a problem when there are lots of non-optimizable sequences.
In order to reroute this we delay allocation of deoptimized sequences
until we actually touch the sequence.

	* iseq.c (rb_iseq_new_with_opt): delay allocating deoptimized
	  sequence.

	* optimize.c (iseq_squash): prepare just in case this is the first
	  time we optimize this target.

	* optimize.c (iseq_eager_optimize): ditto.

	* deoptimize.h (iseq_deoptimize_if_needed): consideration of
	  virgin iseqs.
@rye
Copy link

rye commented Dec 8, 2016

Here is the Make output when I compile using clang.

EDIT: Link to gist rather than dump all the output into here.

@claudiug
Copy link

any status with this one? Is almost one year old :)

@mensfeld
Copy link
Contributor

Interested in this one as well. Any news?

@nijikon
Copy link

nijikon commented Jun 26, 2017

Me also.

@marysieek
Copy link

@pedroaliens @shyouhei any update on this issue?

@benlovell
Copy link

Folks, piling on doesn't help expedite an issue.

@shyouhei
Copy link
Member Author

Hi, quick update is that it's still work in progress. I got comments from other core-devs including @ko1 , so I think I need his review to merge this, once after I finished on peaceful update.

Thank you for your interest!

@NullVoxPopuli
Copy link

NullVoxPopuli commented Aug 4, 2017

I successfully made and installed this, but when I tried running bundle install, I got a segfault:

$ sudo bundle install # happens with and without sudo, though
Don't run Bundler as root. Bundler can ask for sudo if it is needed, and installing your
bundle as root will break this application for all non-root users on this machine.
Fetching gem metadata from https://rubygems.org/./usr/local/lib/ruby/2.4.0/net/http.rb:939: [BUG] Segmentation fault at 0x00000000000010
ruby 2.4.0dev (2016-07-22 deoptimization 55727) [x86_64-linux]

-- Control frame information -----------------------------------------------
c:0023 p:---- s:0138 e:000137 CFUNC  :wait_readable
c:0022 p:0701 s:0134 E:002290 METHOD /usr/local/lib/ruby/2.4.0/net/http.rb:939
c:0021 p:0008 s:0121 e:000120 METHOD /usr/local/lib/ruby/2.4.0/net/http.rb:869
c:0020 p:0076 s:0118 e:000117 METHOD /usr/local/lib/ruby/2.4.0/net/http.rb:864
c:0019 p:0045 s:0115 e:000114 METHOD /usr/local/lib/ruby/gems/2.4.0/gems/bundler-1.15.3/lib/bundler/vendor/net-http-persistent/lib/net/http/persistent.rb:702
c:0018 p:0530 s:0110 E:001a50 METHOD /usr/local/lib/ruby/gems/2.4.0/gems/bundler-1.15.3/lib/bundler/vendor/net-http-persistent/lib/net/http/persistent.rb:633
c:0017 p:0041 s:0097 e:000096 METHOD /usr/local/lib/ruby/gems/2.4.0/gems/bundler-1.15.3/lib/bundler/vendor/net-http-persistent/lib/net/http/persistent.rb:996
c:0016 p:0148 s:0084 e:000083 METHOD /usr/local/lib/ruby/gems/2.4.0/gems/bundler-1.15.3/lib/bundler/fetcher/downloader.rb:50
c:0015 p:0048 s:0075 e:000074 METHOD /usr/local/lib/ruby/gems/2.4.0/gems/bundler-1.15.3/lib/bundler/fetcher/downloader.rb:16
c:0014 p:0028 s:0067 e:000066 METHOD /usr/local/lib/ruby/gems/2.4.0/gems/bundler-1.15.3/lib/bundler/fetcher/compact_index.rb:116
c:0013 p:0161 s:0061 e:000060 BLOCK  /usr/local/lib/ruby/gems/2.4.0/gems/bundler-1.15.3/lib/bundler/compact_index_client/updater.rb:50
c:0012 p:0048 s:0054 e:000053 METHOD /usr/local/lib/ruby/2.4.0/tmpdir.rb:89
c:0011 p:0026 s:0047 e:000046 METHOD /usr/local/lib/ruby/gems/2.4.0/gems/bundler-1.15.3/lib/bundler/compact_index_client/updater.rb:30
c:0010 p:0068 s:0040 e:000039 METHOD /usr/local/lib/ruby/gems/2.4.0/gems/bundler-1.15.3/lib/bundler/compact_index_client.rb:81
c:0009 p:0151 s:0035 e:000034 METHOD /usr/local/lib/ruby/gems/2.4.0/gems/bundler-1.15.3/lib/bundler/compact_index_client.rb:97

@claudiug
Copy link

Hello, now this PR is here for 2 years. Any news?

@shyouhei
Copy link
Member Author

This pull request was a bit too big for other core devs to review. I changed my strategy to push things forward. As you see above several parts of this pull request were separated then proposed. Some of them had already been merged, to gain ~1% speedup already.

Right now I am preparing another single-cut branch which brings some more boost. Yet to be pull requested because it fails some tests right now. I hope it can be rerouted soon.

@shyouhei shyouhei mentioned this pull request Mar 19, 2019
@k0kubun k0kubun changed the base branch from trunk to master August 15, 2019 17:37
@shyouhei
Copy link
Member Author

shyouhei commented Sep 9, 2019

Like I wrote above we are not going to merge this pull request as-is. So close now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet