Speed ISeq marking by using a bitmap and rearranging inline caches #6053

tenderlove · 2022-06-22T22:08:36Z

A large percentage of major GC time is spent marking instruction sequence objects. This PR aims to speed up major GC by speeding up marking instruction sequence objects.

Marking ISeq objects

Today we have to disassemble instruction sequences in order to mark them. The disassembly process looks for GC allocated objects and marks them. To disassemble an iseq, we have to iterate over each instruction, convert the instruction from an address back to the original op code integer, then look up the parameters for the op code. Once we know the parameter types, we can iterate though them and mark "interesting" references. We can see this process in the iseq_extract_values function.

According to profile results, the biggest bottleneck in this function is converting addresses back to instruction ids.

Speeding up ISeq marking

To speed up ISeq marking, this PR introduces two changes. The first change is adding a bitmap, and the second change is rearranging inline caches to be more "convenient".

Bitmaps

At compilation time, we allocate a bitmap along side of the iseq object. The bitmap indicates offsets of VALUE objects inside the instruction sequences. When marking an instruction, we can simply iterate over the bitmap to find VALUE objects that need to be marked.

Inline Cache Rearrangement

Inline cache types IC, IVC, ICVARC, and ISE are allocated from a buffer that is stored on the iseq constant body. These caches are a union type. Unfortunately, these union types don't have a "type" field, so they can only be distinguished by looking at the parameter types of an instruction.

Take the following Ruby code for example:

Foo =~ /#{foo}/o;

The instruction sequences for this code are as follows:

== disasm: #<ISeq:<main>@-e:1 (1,0)-(1,17)> (catch: FALSE)
0000 opt_getinlinecache                     9, <is:0>                 (   1)[Li]
0003 putobject                              true
0005 getconstant                            :Foo
0007 opt_setinlinecache                     <is:0>
0009 once                                   block in <main>, <is:1>
0012 opt_regexpmatch2                       <calldata!mid:=~, argc:1, ARGS_SIMPLE>[CcCr]
0014 leave

The ISeq object contains two entries in the is_entries buffer, one for the ISE cache associated with the once instruction, and one for the IC cache associated with the opt_getinlinecache and opt_setinlinecache instructions.

Unfortunately we cannot iterate through the caches in the is_entries list because the union types don't have the same layout. Marking an ISE is very different than marking an IC, and we can only differentiate them by disassembling and checking the instruction sequences themselves.

To solve this problem, this PR introduces 3 counters for the different types of inline caches. Then, we group inline cache types within the is_entries buffer.
Since the inline cache types are grouped, we can use the counters to iterate over the buffer and we know what type is being used.

Combining bitmap marking and inline cache arrangement means that we can mark instruction sequences without disassembling the instructions.

Speed impact

I benchmarked this change with a basic Rails application using the following script:

puts RUBY_DESCRIPTION

require "benchmark/ips"

Benchmark.ips do |x|
  x.report("major GC") { GC.start }
end

Here are the results with the master version of Ruby:

$ RAILS_ENV=production gel exec bin/rails r test.rb
ruby 3.2.0dev (2022-06-22T12:30:39Z master 744d17ff6c) [arm64-darwin21]
Warming up --------------------------------------
            major GC     4.000  i/100ms
Calculating -------------------------------------
            major GC     47.748  (± 2.1%) i/s -    240.000  in   5.028520s

Here it is with these patches applied:

$ RAILS_ENV=production gel exec bin/rails r test.rb
ruby 3.2.0dev (2022-06-22T20:52:13Z iseq-bitmap 2ba736a7f9) [arm64-darwin21]
Warming up --------------------------------------
            major GC     7.000  i/100ms
Calculating -------------------------------------
            major GC     77.208  (± 1.3%) i/s -    392.000  in   5.079023s

With these patches applied, major GC is about 60% faster.

Memory impact

The memory increase is proportional to the number of instructions stored on an iseq. This works about to be about 1% increase in the size of an iseq (ceil(iseq_length / 64) on 64 bit platforms).

Future work

This PR always mallocs a bitmap table. We can eliminate the malloc when:

There is nothing to mark
iseq_length is <= 64

We may also want to consider using a succ_index_table for storing the bitmap

Redmine issue is here

tomascco

Thanks for the great work! I love your contributions 😄. Unfortunately the only part of the PR that I can understand are the comments. 😳 (I'm not experienced in C at all)

vm_core.h

compile.c

casperisfine · 2022-06-23T07:46:09Z

iseq.c

+            while(bits) {
+                if (bits & 0x1) {


Wouldn't it be faster to use ffs here? https://man7.org/linux/man-pages/man3/ffs.3.html Assuming the bitmap is relatively sparse you'd save quite a few iterations. Or am I missing something?

Ya it should be pretty sparse. I'll try it with ffs and see what the numbers look like.

I tried using ffs but it's turning out to be a pain because you can't right shift by 64. So we'd need to add a special case if just the top bit is set. I think we should investigate using ffs, but I want to merge this as-is for now.

Hum, not sure I follow what the problem is. If only the first bit is set, ffs (or ffl more likely) return 64, but the shift should always be bits >>= ffs(bits) - 1 (unless ffs returns 0).

Anyway no big deal. I'd like to try backporting this patch on top of 3.1 though.

This post about iterating over set bits quickly could be useful here, it uses __builtin_ctzl.

Hum, not sure I follow what the problem is. If only the first bit is set, ffs (or ffl more likely) return 64, but the shift should always be bits >>= ffs(bits) - 1 (unless ffs returns 0).

ffs returns the nth bit and starts at a 1 index. So ffs(0x1) => 1. If we did bits >>= ffs(bits) - 1, then it would never shift anything in the case of 0x1, so we have to do bits >>= ffs(bits). ffs of 0x8000000000000000 returns 64, so the right shift doesn't work.

We can probably use __builtin_ctzl, but I think we should see if this iteration is really a bottleneck at the moment.

This commit adds a bitfield to the iseq body that stores offsets inside the iseq buffer that contain values we need to mark. We can use this bitfield to mark objects instead of disassembling the instructions. This commit also groups inline storage entries and adds a counter for each entry. This allows us to iterate and mark each entry without disassembling instructions Since we have a bitfield and grouped inline caches, we can mark all VALUE objects associated with instructions without actually disassembling the instructions at mark time. [Feature #18875] [ruby-core:109042]

Co-authored-by: Tomás Coêlho <36938811+tomascco@users.noreply.github.com>

tomascco reviewed Jun 22, 2022

View reviewed changes

vm_core.h Outdated Show resolved Hide resolved

compile.c Outdated Show resolved Hide resolved

casperisfine reviewed Jun 23, 2022

View reviewed changes

tenderlove force-pushed the iseq-bitmap branch from a33c9a3 to 5f0faea Compare June 23, 2022 19:51

tenderlove and others added 2 commits June 23, 2022 12:59

Update vm_core.h

9e8843f

Co-authored-by: Tomás Coêlho <36938811+tomascco@users.noreply.github.com>

tenderlove force-pushed the iseq-bitmap branch from 5f0faea to 9e8843f Compare June 23, 2022 20:03

tenderlove merged commit 1ccdb1a into ruby:master Jun 23, 2022

tenderlove deleted the iseq-bitmap branch June 23, 2022 21:01

casperisfine mentioned this pull request Jun 24, 2022

iseq.c: Use ntz_intptr for faster bitmap scan #6059

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed ISeq marking by using a bitmap and rearranging inline caches #6053

Speed ISeq marking by using a bitmap and rearranging inline caches #6053

tenderlove commented Jun 22, 2022 •

edited

tomascco left a comment •

edited

casperisfine Jun 23, 2022

tenderlove Jun 23, 2022

tenderlove Jun 23, 2022

casperisfine Jun 23, 2022

Maumagnaguagno Jun 23, 2022

tenderlove Jun 24, 2022

Speed ISeq marking by using a bitmap and rearranging inline caches #6053

Speed ISeq marking by using a bitmap and rearranging inline caches #6053

Conversation

tenderlove commented Jun 22, 2022 • edited

Marking ISeq objects

Speeding up ISeq marking

Bitmaps

Inline Cache Rearrangement

Speed impact

Memory impact

Future work

tomascco left a comment • edited

Choose a reason for hiding this comment

casperisfine Jun 23, 2022

Choose a reason for hiding this comment

tenderlove Jun 23, 2022

Choose a reason for hiding this comment

tenderlove Jun 23, 2022

Choose a reason for hiding this comment

casperisfine Jun 23, 2022

Choose a reason for hiding this comment

Maumagnaguagno Jun 23, 2022

Choose a reason for hiding this comment

tenderlove Jun 24, 2022

Choose a reason for hiding this comment

tenderlove commented Jun 22, 2022 •

edited

tomascco left a comment •

edited