Diving into How Hashes Work in Ruby

Ruby’s Hash is arguably the most useful data type in the language. Utilizing a structure that lends itself well to modelling real-world problems, combined with speedy lookups, it is frequently the backbone of scripts, web pages, and complex business logic. It was even once used internally by Ruby to store all your functions, variables, and constants (although modern Ruby approaches this differently by using branch prediction of hot paths and call-site method caching; that’s another post!) We often take it for granted that it “just works”. Today we’re going to delve into the internals of Hash, seeing exactly why it works, how it is so efficient, and how it is implemented. Onward!

Implementation

big_four_stock_prices = {
    :google => 801.23,
    :apple => 115.57,
    :facebook => 128.35,
    :microsoft => 57.19
}

What’s going on here? Let’s walk through it.

First, the runtime allocates a contiguous block of memory for the data structure. Next it hashes the first key:

:google.hash
=> 3592

The hash function is declared in Ruby’s kernel class, so it’s available to any class with that in its inheritance chain. The actual number will vary depending on a number of factors, but what is important is that it remains consistent across a session.

Next, we have to store the value. The value is placed in what is called a ‘bucket’ (or a ‘bin’ if you’re coming from other languages). A Ruby hash will allocate 11 buckets initially, so we need to decide in which bucket to store our hash value. For that, the runtime simply performs the following calculation:

 :google.hash % 11
 => 5

This means that the value for :google, which is 801.23, is placed in the 5th bucket of the hash table, with an identifier for its key that is something like:

[:google, 801.23]

It repeats this process for the rest of the key/value pairs. To pull it back out, the interpreter simply repeats the calculation and pulls out the value from the bucket. Simple right? Let’s get into the edge cases.

Collisions

The more astute readers among you may have caught on to the fact that if there are only 11 possible buckets, sooner or later the calculation is going to place two values in the same bucket. What happens then? Well, Ruby’s hash implementation will store the items in the same bin in the form of a linked list. If the identifier (the key) matches the one stored (it checks this with equal?), it returns the value, and if it doesn’t, it simply continues traversing the list.

Of course, this process could be time consuming – what if an item shared a bucket with 1000 others? It could theoretically take a large amount of time to return the desired value, and the whole point of using a hash table is that it is fast. Ruby, of course, has you covered. It does this by keeping a running total of the average number of items in each bin across the whole table. Once this value exceeds 5, Ruby discards the array of 11 bins, creates a new one and it is at this point where Ruby will rehash the values. By doing this, the linked lists in bins are kept short, which decreases overall lookup time.

Lookup time is the key selling point for a hash. For the Big O Notation-inclined amongst you, a hash table will look up values in constant time; that is that as the number of values in the table increases, the lookup time remains the same. In practice, it’s closer to O (log n), but let’s not get too pedantic!

The Importance of a Good Hashing Algorithm

To cement our understanding of the hash implementation, let’s try a brief experiment: What would happen if a custom hashing function returned the same integer each time? Our mental model of the hash suggests that this will append each item to a linked list in the same bin, and the time to look up an item towards the end of that list will be much higher. Let’s try with some code:

require 'benchmark'

class GoodHash
end

class BadHash
  def hash
    7
  end
end

hashing_algorithm == BadHash  # switch to GoodHash for inverse

17.times do |ex|
  lookup_key = nil
  size = 2**ex
  test_hash = {}
  (1..size).each do |n|
    index = hashing_algorithm.new
    lookup_key = index if n > size/2 and lookup_key.nil?
    test_hash[index] = rand
  end

  Benchmark.bm do |bench|
    bench.report("Hash size: #{size} elements 10000 times\n") do
      10000.times do
        test_hash[lookup_key]
      end
    end
  end
end

This will generate a hash (of varying sizes) similar to this:

test_hash = {
    #<BadHash:0x007f92941eaae0> => 0.029655456018705673,
    #<BadHash:0x007f92941e3c68> => 0.973576967117432,
    #<BadHash:0x007f92941e1508> => 0.2443654110192186,
    #<BadHash:0x007f92941d1ab8> => 0.7536013342129247
}

Let’s walk through the code:

First we create a class, GoodHash, which uses Ruby’s native hashing algorithm. Then we loop 17 times (65536 items seems a good place to stop!), exponentiating a base of 2 to the power (ex). We then insert a random value into test_hash using an instance of GoodHash as the key. We look up on that value 1000 times and benchmark the results, before repeating the process with BadHash

Results (readings taken from ‘real’ column):

Hash size	GoodHash	BadHash
1	0.002808	0.003077
2	0.002841	0.003663
4	0.003018	0.004167
8	0.005757	0.004750
16	0.003502	0.006898
32	0.003174	0.011138
64	0.002843	0.019534
128	0.003080	0.041806
256	0.003159	0.086643
512	0.003137	0.147724
1024	0.003179	0.287432
2048	0.002843	0.554049
4096	0.002828	1.101022
8192	0.005380	2.192948
16384	0.003037	4.385111
32768	0.002830	8.739381
65536	0.002830	17.314654

This proves our hypothesis fairly conclusively. As the number of items in the hash grows, the lookup time with the BadHash class grows exponentially, while the GoodHash implementation remains firmly grouped around the same sort of time. The outliers in the GoodHash column also illustrate Ruby rehashing once the buckets get too full.

Interestingly, Ruby 2.0 onwards doesn’t work like this for smaller hashes. If your hash is 6 items or fewer, it is stored as an array list and traversed until the key matches the lookup. While hash tables are quick, by far the biggest cost is calculating the hash value; a process neatly sidestepped with this approach. In the event that the hash grows to 7 items, the array is discarded and the traditional hash is used. This is illustrated in the results table as with the entry at a hash size of 8.

How Does Ruby Allocate Bins?

Since the number of bins don’t increase linearly with the number of items in them, how does Ruby decide how many bins a hash should have? Well, Ruby calculates a list of prime numbers close to powers of 2. The benefit of this is that the number when combined with the hash value is much less likely to share a common factor. If this wasn’t true, then the modulus calculation we covered earlier could yield the same number each time, which would put everything into the same bin, not unlike our experiment prior.

Conclusion

For a dive into the internals of Ruby, this was relatively straightforward. While we’ve covered hashes pretty extensively, it is important to know when and where to use them. By far the largest costs in hashing are the rehashing, hash calculation, and bucketing, so be mindful of these when performance is a factor (the JRuby implementation avoids relying on hashes internally for this exact reason). Generally however, a hash is an excellent tool to reach for, given its intelligent design and careful optimizations. We now know exactly how a hash works, why it is so performant, and never to override the hash function for an object!

Frequently Asked Questions about Hashes in Ruby

What is the difference between a hash and an array in Ruby?

In Ruby, both hashes and arrays are used to store data. However, the key difference lies in how they store and retrieve this data. An array is an ordered list of elements that are accessed by their index, which is a numerical value. On the other hand, a hash, also known as a dictionary, is a collection of key-value pairs. The keys in a hash can be any object, and they are used to retrieve the corresponding values. This makes hashes more flexible and powerful than arrays for many tasks.

How can I create a new hash in Ruby?

Creating a new hash in Ruby is straightforward. You can use the Hash.new method or the literal notation with curly braces {}. Here’s an example of both:

# Using Hash.new
hash = Hash.new
hash["key"] = "value"

# Using literal notation
hash = {"key" => "value"}

In both cases, a new hash is created with a single key-value pair.

How can I access the values in a hash?

You can access the values in a hash by using their corresponding keys. Here’s an example:

hash = {"key" => "value"}
puts hash["key"] # Outputs: value

Can a hash have duplicate keys?

No, a hash cannot have duplicate keys. If you try to insert a key-value pair with a key that already exists in the hash, the new value will replace the old one. This is because keys in a hash are unique.

How can I delete a key-value pair from a hash?

You can delete a key-value pair from a hash using the delete method. Here’s an example:

hash = {"key" => "value"}
hash.delete("key") # Removes the key-value pair {"key" => "value"}

How can I iterate over a hash?

You can iterate over a hash using the each method, which passes each key-value pair to a block of code. Here’s an example:

hash = {"key1" => "value1", "key2" => "value2"}
hash.each do |key, value|
puts "#{key}: #{value}"
end

Can a hash key or value be any type of object?

Yes, in Ruby, both the keys and values in a hash can be any type of object. This includes strings, numbers, arrays, other hashes, and even custom objects.

How can I check if a hash contains a specific key or value?

You can check if a hash contains a specific key using the has_key? method, and if it contains a specific value using the has_value? method. Here’s an example:

hash = {"key" => "value"}
puts hash.has_key?("key") # Outputs: true
puts hash.has_value?("value") # Outputs: true

How can I merge two hashes?

You can merge two hashes using the merge method. This method returns a new hash that combines the contents of the original hash and the hash passed as an argument. If both hashes have the same key, the value from the second hash will be used. Here’s an example:

hash1 = {"key1" => "value1"}
hash2 = {"key2" => "value2"}
merged_hash = hash1.merge(hash2) # {"key1" => "value1", "key2" => "value2"}

How can I convert a hash to an array and vice versa?

You can convert a hash to an array using the to_a method, which returns an array of arrays, each containing a key-value pair. To convert an array back to a hash, you can use the to_h method. Here’s an example:

hash = {"key" => "value"}
array = hash.to_a # [["key", "value"]]
hash = array.to_h # {"key" => "value"}