Weird stuff with hashes

2015-02-11 @ 09:45

Ok, so this isn’t really weird stuff, I don’t think. But maybe people don’t know about it so I thought I would post.

Hashes dup string keys

When you assign a string key to a hash, the hash copies the string and freezes it. Here is an example:

>> x = 'string'
=> "string"
>> y = {}
=> {}
>> y[x] = :value
=> :value
>> { x => y.keys.first }
=> {"string"=>"string"}
>> { x.object_id => y.keys.first.object_id }
=> {70157126548460=>70157126526760} # object ids are different
>> { x.frozen? => y.keys.first.frozen? }
=> {false=>true} # frozen value is different

You can prevent the string from being duped by freezing the string before putting it in the hash:

>> x = 'string'
=> "string"
>> x.freeze
=> "string"
>> y = {}
=> {}
>> y[x] = :value
=> :value
>> { x.object_id => y.keys.first.object_id }
=> {70157126414020=>70157126414020} # note that both object ids are the same

Why is this important?

I’ve been working with @eileencodes to improve the performance of integration tests in Rails. It turns out that the tests are spending a significant amount of time in the garbage collector. We started tracking down allocations, and found that setting values in the header hash was creating many strings.

To partially fix this, we added a freeze to the string before inserting it in to the header hash. This cut the allocations down. The code used to allocate one string for the “downcased” version, then another when the downcased version was inserted in to the hash.

Setting strings as hash keys may be a source of unnecessary object allocation in your application. Now, don’t go freezing all the strings. Make sure to measure where object allocations are happening in your app first (a subject that I’ll cover in a later blog post), then freeze strings where appropriate.

Why does the Hash dup strings?

I think there are a few reasons.

One, if you mutate the string after adding it to the hash, it would be surprising (to me anyway) if the key for the hash was mutated too. For example:

>> x = 'string'
=> "string"
>> y = {}
=> {}
>> y[x] = :value
=> :value
>> x.sub!(/r/, '')
=> "sting"
>> y.key? x
=> false
>> { y.keys => x }
=> {["string"]=>"sting"}

I think it would be pretty surprising if, after mutating the object in place, it would still be part of the hash.

I think the other more technical reason is that mutating the string changes the hash value of the string, and means your hash would need to be rehashed in order to find the same key:

>> x.hash
=> 4078764570319932244
>> x.sub!(/r/, '')
=> "sting"
>> x.hash
=> -1518905602476999392

Note that the hash value changed.

Hash value, what does that have to do with anything?

If you change the hash value of a hash key, then the key will be in the wrong place in the hash. I think this might be more clear with a code example.

You may know about the two methods you need to implement for creating a custom hash key: hash indicates where in the hash the object will be stored, and eql? is used in the case there is a hash collision.

Lets make an object that allows us to mutate the hash value, and see how it behaves after inserting it in a hash:

class Foo
  attr_accessor :hash
end

x = Foo.new
x.hash = 10

hash = {}
hash[x] = :hello

p hash.key?(x)          # => true
p hash.keys.include?(x) # => true

x.hash = 11 # change the hash value

p hash.key?(x)          # => false
p hash.keys.include?(x) # => true

hash.rehash

p hash.key?(x)          # => true
p hash.keys.include?(x) # => true

First we see that x is a key to the hash, and is in the list of keys. After we mutate the hash value, Hash#key? returns false, yet Array#include? returns true when we check for the object in the list of keys.

The reason Hash#key? returns false is because it’s using the value of hash to find the object in the hash buckets. When we first inserted the object in the hash, it went in to the “10” bucket. But now that we’ve changed the value to “11”, the hash will look for the object in the “11” bucket, which is the wrong place!

To fix this, we call the rehash method on the hash. The rehash method will redistribute the keys. Since the object in the “10” bucket is now supposed to be in the “11” bucket, it will move the object to the right place, and Hash#key? will work correctly.

Wouldn’t it be annoying if you had to do the same song and dance as this if you mutated your strings? I think this is the main reason why hashes dup and freeze string keys.

What is the lesson?

Don’t mutate the hash value on hash keys! You’ll have a bad time.

Anyway, I don’t think this is weird or anything, it’s just something that we don’t have to deal with on a day to day basis. Maybe I should have titled the post “Uncommon things to do with hashes”.

Thanks for reading!!! <3<3<3<3<3

Performance P.S. (is that a P.P.S.?)

An interesting thing is that Hash#dup will rehash the newly created hash, where Hash[] will not:

class Foo
  attr_accessor :hash
end

x = Foo.new
x.hash = 10

hash = {}
hash[x] = :hello

p hash.key?(x)          # => true

x.hash = 11 # change the hash value

p hash.key?(x)          # => false
p hash.dup.key?(x)      # => true
p Hash[hash].key?(x)    # => false

This means that duping a hash with Hash[] can end up being faster than using Hash#dup. Here is a benchmark to demonstrate:

require 'benchmark/ips'

hash = Hash[*('a'..'z').to_a]
Benchmark.ips do |x|
  x.report("Hash#dup") do
    hash.dup
  end

  x.report("Hash[]") do
    Hash[hash]
  end
end

Results on my machine:

Calculating -------------------------------------
            Hash#dup     7.705k i/100ms
              Hash[]    15.978k i/100ms
-------------------------------------------------
            Hash#dup     93.648k (± 4.9%) i/s -    470.005k
              Hash[]    230.497k (±11.2%) i/s -      1.150M

Does this mean that you should switch to Hash[]? Only if your benchmarks can prove that it’s a bottleneck. Please please please don’t change all of your code because this shows it’s faster. Make sure to measure your app performance first. But you already knew that, which is why you’re reading the PS section. ;-)

« go back