New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed ISeq marking by using a bitmap and rearranging inline caches #6053
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great work! I love your contributions 😄. Unfortunately the only part of the PR that I can understand are the comments. 😳 (I'm not experienced in C at all)
while(bits) { | ||
if (bits & 0x1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be faster to use ffs
here? https://man7.org/linux/man-pages/man3/ffs.3.html Assuming the bitmap is relatively sparse you'd save quite a few iterations. Or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya it should be pretty sparse. I'll try it with ffs
and see what the numbers look like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried using ffs
but it's turning out to be a pain because you can't right shift by 64. So we'd need to add a special case if just the top bit is set. I think we should investigate using ffs
, but I want to merge this as-is for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum, not sure I follow what the problem is. If only the first bit is set, ffs
(or ffl
more likely) return 64
, but the shift should always be bits >>= ffs(bits) - 1
(unless ffs returns 0
).
Anyway no big deal. I'd like to try backporting this patch on top of 3.1 though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This post about iterating over set bits quickly could be useful here, it uses __builtin_ctzl
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hum, not sure I follow what the problem is. If only the first bit is set, ffs (or ffl more likely) return 64, but the shift should always be bits >>= ffs(bits) - 1 (unless ffs returns 0).
ffs
returns the nth bit and starts at a 1 index. So ffs(0x1) => 1
. If we did bits >>= ffs(bits) - 1
, then it would never shift anything in the case of 0x1
, so we have to do bits >>= ffs(bits)
. ffs
of 0x8000000000000000
returns 64, so the right shift doesn't work.
We can probably use __builtin_ctzl
, but I think we should see if this iteration is really a bottleneck at the moment.
This commit adds a bitfield to the iseq body that stores offsets inside the iseq buffer that contain values we need to mark. We can use this bitfield to mark objects instead of disassembling the instructions. This commit also groups inline storage entries and adds a counter for each entry. This allows us to iterate and mark each entry without disassembling instructions Since we have a bitfield and grouped inline caches, we can mark all VALUE objects associated with instructions without actually disassembling the instructions at mark time. [Feature #18875] [ruby-core:109042]
Co-authored-by: Tomás Coêlho <36938811+tomascco@users.noreply.github.com>
A large percentage of major GC time is spent marking instruction sequence objects. This PR aims to speed up major GC by speeding up marking instruction sequence objects.
Marking ISeq objects
Today we have to disassemble instruction sequences in order to mark them. The disassembly process looks for GC allocated objects and marks them. To disassemble an iseq, we have to iterate over each instruction, convert the instruction from an address back to the original op code integer, then look up the parameters for the op code. Once we know the parameter types, we can iterate though them and mark "interesting" references. We can see this process in the
iseq_extract_values
function.According to profile results, the biggest bottleneck in this function is converting addresses back to instruction ids.
Speeding up ISeq marking
To speed up ISeq marking, this PR introduces two changes. The first change is adding a bitmap, and the second change is rearranging inline caches to be more "convenient".
Bitmaps
At compilation time, we allocate a bitmap along side of the iseq object. The bitmap indicates offsets of VALUE objects inside the instruction sequences. When marking an instruction, we can simply iterate over the bitmap to find VALUE objects that need to be marked.
Inline Cache Rearrangement
Inline cache types
IC
,IVC
,ICVARC
, andISE
are allocated from a buffer that is stored on the iseq constant body. These caches are a union type. Unfortunately, these union types don't have a "type" field, so they can only be distinguished by looking at the parameter types of an instruction.Take the following Ruby code for example:
The instruction sequences for this code are as follows:
The ISeq object contains two entries in the
is_entries
buffer, one for theISE
cache associated with theonce
instruction, and one for theIC
cache associated with theopt_getinlinecache
andopt_setinlinecache
instructions.Unfortunately we cannot iterate through the caches in the
is_entries
list because the union types don't have the same layout. Marking anISE
is very different than marking anIC
, and we can only differentiate them by disassembling and checking the instruction sequences themselves.To solve this problem, this PR introduces 3 counters for the different types of inline caches. Then, we group inline cache types within the
is_entries
buffer.Since the inline cache types are grouped, we can use the counters to iterate over the buffer and we know what type is being used.
Combining bitmap marking and inline cache arrangement means that we can mark instruction sequences without disassembling the instructions.
Speed impact
I benchmarked this change with a basic Rails application using the following script:
Here are the results with the master version of Ruby:
Here it is with these patches applied:
With these patches applied, major GC is about 60% faster.
Memory impact
The memory increase is proportional to the number of instructions stored on an iseq. This works about to be about 1% increase in the size of an iseq (
ceil(iseq_length / 64)
on 64 bit platforms).Future work
This PR always mallocs a bitmap table. We can eliminate the malloc when:
We may also want to consider using a
succ_index_table
for storing the bitmapRedmine issue is here