How to Remove a Column with Zero Downtime in Ruby on Rails

For a production Ruby on Rails application, uptime is paramount. Altering the structure of an application’s persistence layer is an operation that competes directly with uptime. Specifically, removing a table column within a relational database causes issues with the ActiveRecord ORM (the default relational mapping within a Ruby on Rails application).

However, this particular pain point has been removed as of Ruby on Rails 4.0.0, saving developers a lot of headache and greatly reducing the need for structural change coordination.

Old and Busted

To demonstrate the problem, a simple Ruby on Rails 3.2 application is created with a User model:

class User < ActiveRecord::Base
end

Supporting the User model is a PostgreSQL database table created with a migration:

class CreateUsers < ActiveRecord::Migration
  def change
    create_table :users do |t|
      t.string :first_name
      t.string :last_name
      t.string :email
      t.timestamps null: false
    end
  end
end

To ensure this application has as close to 100% uptime as possible, it can be assumed that it is behind some kind of pre-loader. As in, when code is deployed, existing requests are given time to complete before new requests are shepherded over to the new version of the code.

In a Ruby on Rails 3.2 application, a problem will arise when a column is removed from the database and the ORM does not have time to restart. In this case, even the pre-loader will not save an application from throwing an error about the missing column.

To emulate this problem, a rails console is run in the production environment and a User is created:

RACK_ENV=production rails console
> User.create(first_name: 'test', last_name: 'test')
# => #<User id: 1, first_name: "test", last_name: "test", email nil, created_at: "2016-02-07 21:03:26", updated_at: "2016-02-07 21:03:26">

In parallel, the psql command line client is used to connect to the same production database used by this server, all still running locally.

Within this psql prompt, the email column from the users table will be removed while the production console is still running:

psql app_production
psql=# ALTER TABLE users DROP COLUMN email;
ALTER TABLE

Jumping back to the RACK_ENV=production rails console, another User creation attempt results in an error:

> User.create(first_name: 'test2', last_name: 'test2')
# => PG::UndefinedColumn: ERROR:  column "email" of relation "users" does not exist

Kaboom! Rails attempts to save a new record to the users table and populate the email column even though it was never specified to do so. Seeing as this is not new behaviour, surely someone has come up with a solution to this problem that does not require downtime.

Yesterday’s Jam

As it turns out, a few different blog posts have been written about the solution to this very problem.

The TL;DR of these posts is that a multiple phase solution is required under the given circumstances:

1. Write code that `rejects` the column to remove
2. Deploy
3. Write migration to remove the column
4. Deploy + Migrate
5. Write code to remove the column `rejection`
6. Deploy

Phew! Three deploys and presumably three code reviews later, the column is finally removed. Assuming a system is built with continuous deployment guarded by a continuous integration suite, this whole process could take the better part of a work day in order to accomplish such a simple task.

Even worse, if an application has scheduled weekly deploys where only one deployment can happen per week, then this column will linger for three whole weeks. That is just absurd, what kind of barbarians can live this way?

New Hotness

Luckily, the incredible people who maintain and create features for the Ruby on Rails framework saw this behaviour and decided to correct it. As of a very influential commit during the beta release of Ruby on Rails 4.0.0, the framework does not aggressively attempt to populate attributes not specified during creation or update.

To prove this, a new Ruby on Rails 4.0.0 application is made with the exact same structure as above.

Then, the same production console is created along with a User:

RACK_ENV=production rails console
> User.create(first_name: 'test', last_name: 'test')
# => #<User id: 1, first_name: "test", last_name: "test", email nil, created_at: "2016-02-07 22:09:49", updated_at: "2016-02-07 22:09:49">

Again, in parallel, a psql client is initialized for the production environment database.

Within this prompt, the email column on the users table is removed:

psql app_production
psql=# ALTER TABLE users DROP COLUMN email;
ALTER TABLE

And finally, a new user is created in the still running RACK_ENV=production rails console:

> User.create(first_name: 'test2', last_name: 'test2')
# => #<User id: 2, first_name: "test2", last_name: "test2", email nil, created_at: "2016-02-07 22:13:18", updated_at: "2016-02-07 22:13:18">

Unlike before, the second insertion into the users table worked wonderfully, despite the email column’s removal.

As shown, the in memory representation of the User class still has the email attribute. The User object’s representation has not been reloaded, it simply just does not care about the email column when writing to the database.

The fact that the ORM does not specifically reload when writing to a table points out an interesting detail. It means that for this solution to fully work, the email column must not be referenced at all.

With this new found behaviour, the previous six step incantation comprised of three separate deployments can be reduced to three steps in total and just one deployment:

1. Remove references to column and write migration
2. Deploy
3. Once requests completely use new code, migrate

Wunderbar! The reject and deploy pattern is no longer necessary. Further more, development teams suffering within weekly scheduled deploys can achieve their desired results within a single iteration.

What Changed?

Unfortunately, the amazing change responsible for this vast improvement regarding Ruby on Rails column removal did not receive the attention is deserves. Many blog posts and guides were written about how to circumvent the older 3.2 behaviour, but when looking for information about this modern implementation, resources were scarce.

After reading through a very large amount of change logs and coming up empty, a bit of previous wisdom was utilized in order to track down the commit responsible.

The first step is to identify the offending code in 3.2. This way, the 4.0.0 version of the code can be analyzed, the discrepancy identified and the git revision responsible brought to light.

After some binding.pry insertions and utilization of the caller array, the offending code was traced to lib/active_record/attribute_methods.rb:244 (formatted for readability):

def arel_attributes_values(include_primary_key = true,
                           include_readonly_attributes = true,
                           attribute_names = @attributes.keys)
  attrs      = {}
  klass      = self.class
  arel_table = klass.arel_table

  attribute_names.each do |name|
    if (column = column_for_attribute(name)) &&
        (include_primary_key || !column.primary)

      if include_readonly_attributes ||
         !self.class.readonly_attributes.include?(name)
        value =
          if klass.serialized_attributes.include?(name)
            @attributes[name].serialized_value
          else
            # FIXME: we need @attributes to be used consistently.
            # If the values stored in @attributes were already type
            # casted, this code could be simplified
            read_attribute(name)
          end

        attrs[arel_table[name]] = value
      end
    end
  end

  attrs
end

While a little lengthy, the most obvious thing to note about this method is that its signature is comprised of entirely optional variables. Most notably, the final optional parameter, attribute_names, defaults to @attributes.keys which is the list of attributes for the given model.

In this case, the result of @attributes.keys is the complete list of keys for a User: ['id', 'first_name', 'last_name', 'email', 'updated_at', 'created_at']. Most notably, the email attribute is present, regardless of the fact that it was not given a value to insert. This means that by default, any create call will write to the email column without hesitation (just as previously observed).

While default behaviour can emit a code smell, it is only dangerous if it is used unknowingly. Since the parameter to determine attribute_names is available, chances are the caller of this method does indeed specify them, right?

Unfortunately, no:

def create
  attributes_values = arel_attributes_values(!id.nil?)
  new_id = self.class.unscoped.insert attributes_values

  self.id ||= new_id if self.class.primary_key

  IdentityMap.add(self) if IdentityMap.enabled?
  @new_record = false
  id
end

Here, in lib/active_record/persistence.rb:364 lies the method responsible for calling arel_attributes_values. Since it does not specify the list of attributes, the default behaviour prevails and all known attributes are returned.

The Big Fix

On a glorious Friday afternoon (probably) Jon Leighton made the commit which changed how people thought about column removal in Ruby on Rails.

The change enabled the concept of partial_writes. This pattern used the exiting changed hash present in every ActiveRecord model. The changed hash is manipulated as a model has attributes assigned to it.

In lib/active_record/attribute_methods/dirty.rb:77, partial_writes is used:

def create_record(*)
  partial_writes? ? super(keys_for_partial_write) : super
end

# Serialized attributes should always be written in case they've been
# changed in place.
def keys_for_partial_write
  changed | (attributes.keys & self.class.serialized_attributes.keys)
end

Before delegating to the same persistence.rb#create method, only the attributes present in the changed hash (along with any serialized attributes) will be used in subsequent insert queries.

Included in the February 2013 beta release of Ruby on Rails 4.0.0, this code enables a much more pleasant experience than the older work around. In fact, the author even pointed out this behaviour in his commit:

When inserting new records, only the fields which have been changed from the defaults will actually be included in the INSERT statement. The other fields will be populated by the database.

This is more efficient, and also means that it will be safe to remove database columns without getting subsequent errors in running app processes (so long as the code in those processes doesn’t contain any references to the removed column).

Unshackled

While this patch to the Ruby on Rails framework may not make every situation easier, it certainly reduces both cognitive and implementation overhead for many developers.

For more automated systems utilizing continuous deployment, two deployments may still be necessary since code running on old processes (within some sort of pre loader) will not be able to handle the column removal.

The deployment process for such a system may look more like:

1. Remove references to column
2. Deploy
3. Write migration to drop column
4. Deploy and migrate

Databases that power web applications are powerful and oddly fragile beasts. Manipulating these beasts while maintaining superb end user experiences requires one to constantly be learning. Hopefully with this behaviour, one more thing can be crossed off the check list of “things to remember when manipulating the database” and we can all just get back to work.