Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add encryption to Active Record #41659

Merged
merged 67 commits into from Apr 1, 2021
Merged

Conversation

jorgemanrubia
Copy link
Contributor

@jorgemanrubia jorgemanrubia commented Mar 11, 2021

This adds encrypted attributes to Active Record models. This is an extraction from HEY.

You can check the guide included with this PR to learn more about functionality and features.

Basic usage

Encrypted attributes are declared at the model level. These are regular active record attributes backed by a column with the same name. The library will transparently encrypt these attributes before saving them into the database and will decrypt them when retrieving their values.

class Person < ApplicationRecord
  encrypts :name
end

By default, it will encrypt all the data using AES-GCM with a 256-bits key and a non-deterministic approach: encrypting the same text twice will result in different ciphertexts.

You can configure it to use a deterministic encryption approach instead. In this case, the initialization vector will be derived from the text to encrypt instead of being random, which enables queries against encrypted columns:

class Person < ApplicationRecord
  encrypts :name
  encrypts :email_address, deterministic: true
end

# Person.find_by(name: "jorge") # doesn't work
# Person.find_by(email_address: "jorge@basecamp.com") # works

The library includes many additional features.

  • Encrypt action text attributes
  • Support ignoring case
  • Work with encrypted and unencrypted data simultaneously.
  • Configurable encryption schemes
  • Support old encryption schemes while migrating existing encrypted data to a new scheme
  • Compression
  • Fixtures support
  • Filter parameters based on encrypted attributes (to remove sensitive params from logs).

Please check the guide in this PR to learn more about setup and usage.

Implementation overview

EncryptableRecord is the concern that makes ActiveRecord::Base records encryptable. It defines .encrypts and the record-level API.

When declaring an attribute as encrypted, internally, it uses a custom active record attribute type (EncryptedAttributeType). This performs the encryption and decryption when serializing and deserializing attribute values. It interacts with the encryption system to do this. The main components of the encryption system are:

  • Encryptor: the internal API for encrypting and decrypting data. It interacts with a KeyProvider to build encrypted messages and deal with their serialization. The encryption/decryption itself is done by the Cipher and the serialization by MessageSerializer.
  • Cipher the encryption algorithm (Aes 256 GCM)
  • KeyProvider serves encryption and decryption keys.
  • MessageSerializer: in charge of serializing and deserializing encrypted payloads (Message).

While most users won't need to do this, these components can be customized via config settings like:

config.active_record.encryption.encryptor = MyEncryptor.new

They can also be overridden just for a given block or on a per-attribute basis:

ActiveRecord::Encryption.with_encryption_context(encryptor: MyEncryptor.new) do
  ...
end

class Article
  encrypts :title, encryptor: MyEncryptor.new
end

Queries with deterministic attributes

When querying data encrypted deterministically, a tricky problem arises when you combine encrypted and unencrypted data. Or data encrypted with different schemes (e.g: because you decide to change encryption properties for a given attribute).

To solve this, we extend Active Record to modify queries automatically. So, for example, this query:

Contact.find_by(email_address: "jorge@hey.com")

Becomes this under the hood when you enable support for unencrypted data:

Contact.find_by(email_address: [ "jorge@hey.com", "<encrypted jorge@hey.com>" ])

The implementation of this system is in ActiveRecord::Encryption::ExtendedDeterministicQueries. This system has worked flawlessly for HEY since we launched, but the implementation feels a bit brittle and experimental. Because of that, I’ve put it behind a config flag encryption.extend_queries that is false by default.

I’d be happy to get advice (and help!) to improve this implementation.

Message structure

Each encrypted value is a Message that stores:

  • The encrypted payload
  • A public (unencrypted) list of headers (Properties)

Example of headers:

  • Encryption-related information such as initialization vectors.
  • Specific encryption system meta information, such as if the payload is compressed or not.
  • Key-related information such as encrypted keys (for envelope encryption) or key references.

Serialization

Messages are serialized with a JSON-based serializer (MessageSerializer). The serializer is configurable via active_record.encryption.message_serializer.

It's important to use safe serializers that can't deserialize arbitrary objects. A common supported scenario is encrypting existing unencrypted data. An attacker can leverage this to enter a tampered payload before encryption takes place and perform RCE attacks. This means custom serializers should avoid Marshal, YAML.load (use YAML.safe_load instead) or JSON.load (use JSON.parse instead).

Performance

Time

OpenSSL AES encryption/decryption is fast, with submillisecond times for typical payloads (<700kb or less) or just a few milliseconds for multi-MB payloads.

As a result, and in our experience, the impact of adding this encryption system to an app is negligible in terms of performance. Even fast database queries are slower by one order of magnitude or more, so the overhead of encryption gets diluted when combined (not to mention if you add view-rendering to the mix).

This PR include some automated performance tests where a baseline without encryption gets compared with similar code using encryption for different key providers. You will see how the impact on tests is around 30% when encrypting data, but this is because there is no database latency in tests, so the relative impact is magnified. The same test shows identical performance when running in a more realistic environment. These tests are useful for optimization work and to prevent performance regressions.

Storage

Encryption adds storage overhead because:

  • Values are encoded in Base64. This enables storing and transmitting binary data from ciphertexts and encryption metadata without encoding issues.
  • It stores encryption metadata along with the encrypted payload. For example, initialization vectors, auth tags (AES integrity), other keys when using envelope encryption, etc.

The worst-case overload we have measured is around 250 bytes when the built-in envelope encryption key provider is used. For medium and large text columns this overload is negligible, but for string columns of 255 bytes, you should increase their limit accordingly (510 is recommended).

To help to mitigate the encryption overload, the library uses compression automatically when the payload exceeds a certain threshold (140 bytes). Compression is done before encrypting the content (compressing after wouldn't compress much due to randomness). We have observed an actual reduction in space for larger text columns.

Using compression opens a vulnerability where an attacker can enter highly compressible payloads that can cause trouble when uncompressed. To prevent this, the library adds a length validation to the encrypted column based on the database column limit. This can be disabled with the option active_record.encryption.config.validate_column_size.

There is a test StoragePerformanceTest measuring the storage overload in different scenarios. Notice that compression is not really showing much difference there due to using randomly-generated payloads. The test is meant to catch regressions and to help with storage optimization work.

It would be easy to make compression a config option, as a future improvement. Without compression you can expect a consistent 33% overhead due to the Base64 encoding.

Security audit

A security firm reviewed this library before launching HEY. Thanks to it, we could fix an important flaw regarding our deterministic encryption approach in the initial version.

Credits

  • Many folks at Basecamp helped to shape this with their feedback. Special mention to @jeremy who contributed some key design decisions while we were trying to figure this out, and to @rosa and @georgeclaghorn for their help reviewing things and discussing approaches.
  • I checked many similar libraries while working on this one, and I learned good things from all of them. I want to make a special mention to @reidmorrison and symmetric-encryption for showing how encryption was a great scenario for using a custom attribute. Other gems I checked and was inspired by include attr_encrypted, lockbox and vault-rails.

Pending

  • Add configuration options to guide
  • Make the build happy
  • Move performance tests out of the main build (to a rake task)
  • Address feedback by Rafael
  • Test upstreamed version in HEY after all the changes

Copy link
Member

@rafaelfranca rafaelfranca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this.

The general architecture of the feature looks correct but there are some major points that I think we need to fix before releasing this.

  • The first is the coupling to Action Text that is now inside Active Record. I know the dependency is not needed but I don't think Active Record should have any code that know about Action Text inside it. We can inject Action Text specific behavior in the same way we did with has_rich_text methods.
  • The install of query extensions is loading ActiveRecord::Base too early in the boot process. That means that configs set inside config/initializers will not be applied anymore. The fix is easy, we just need to move those extensions to inside on_load blocks.

actiontext/lib/action_text/attribute.rb Show resolved Hide resolved
activerecord/lib/active_record/encryption.rb Outdated Show resolved Hide resolved
activerecord/lib/active_record/encryption.rb Outdated Show resolved Hide resolved
activerecord/lib/active_record/encryption.rb Outdated Show resolved Hide resolved
activerecord/lib/active_record/encryption/cipher.rb Outdated Show resolved Hide resolved
activerecord/lib/active_record/encryption/encryptor.rb Outdated Show resolved Hide resolved
# @TODO It should not patch anything if not needed (no previous schemes or no support for previous encryption schemes)
module ActiveRecord
module Encryption
module ExtendedDeterministicQueries
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would make us comfortable to move the implementation here to the right places and not put it behind a flog?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rafaelfranca I'd like to see other apps using the extended queries and see if it works well for them. The system has worked well for us, without issues, but I wouldn't be surprised if there are queries that it's not covering, for example. Said this, we can certainly invert the default: set the flag to true and let the apps disable it if they want. Maybe that's a better way to get reports during the beta phase.

Although not as part of this PR, I'd like to explore a more robust approach. Now that encryption is part of Active Record, there is no need to patch existing methods I think. I suspect it should be possible to come with a better and maybe simpler implementation of the same idea.

activerecord/lib/active_record/railtie.rb Outdated Show resolved Hide resolved
activerecord/lib/active_record/railtie.rb Outdated Show resolved Hide resolved
activerecord/lib/active_record/railtie.rb Outdated Show resolved Hide resolved
@saghaulor-coupa
Copy link

saghaulor-coupa commented Mar 11, 2021

I'm excited to see this sort of thing become built into the framework. Having maintained attr_encrypted, and developing a few different implementations of crypto at Coupa, I have learned a few things that make life more difficult or more easy. I would like to offer something that we learned during our journey. I hope that you find it useful.

If I've understood this implementation correctly based on https://github.com/basecamp/rails/blob/active-record-encryption/guides/source/active_record_encryption.md#encryption-contexts, the encryption context is not available at the record level, which is to say, it is not stored with the encrypted payload much like the DEK is. There is utility in storing the encryption context with the payload. Namely, assume 10-100s of millions of records are encrypted with a certain encryption context. Then a regulation change or your chosen algorithm is broken and requires some new encryption context. Having the context with the payload means you can simply change to the new default so new records encrypt with the new context and existing records will still be able decrypt with the old context. If a re-encryption operation is necessary, it can be done in the background while the app is live, rather than requiring the app to go offline to re-encrypt with the new context, or having some logic that must make assumptions about the state of the data to facilitate live re-encryption. Having the encryption context live with the encrypted payload makes decryption and re-encryption operations painless.

@jorgemanrubia
Copy link
Contributor Author

jorgemanrubia commented Mar 11, 2021

@sagarjunnarkar

Having the context with the payload means you can simply change to the new default so new records encrypt with the new context and existing records will still be able decrypt with the old context

We support this scenario. Actually, I have had to do this several times in HEY already. It works like this:

Imagine you have a class with an encrypted attribute:

class Person
  encrypts :email, deterministic: true
end

Now, imagine that you need to encrypt it with AES-512 (even if it's a bit futuristic) and you have millions of records encrypted with AES-256. If you made the change like this:

class Person
  encrypts :email, deterministic: true, cipher: Aes512.new 
end

You would have a problem as you say. Existing encrypted data would fail to decrypt. But you can do this:

class Person
  encrypts :email, deterministic: true, cipher: Aes512.new, previous: [ { cipher: Aes256.new } ]
end

Now, the system:

  • Will use Aes512 for encrypting
  • Will try to use Aes512 and Aes256 when decrypting, whatever works

So just running this will re-encrypt the record with the new scheme:

person.encrypt 

As a bonus, if you tried to search a Person:

Person.find_by_email("jorge@hey.com")

It would execute this query under the hood:

Person.find_by(email: [ "<email encrypted with AES512>", "<email encrypted with AES256>" ])

This means that queries keep working but data encrypted with both encryption schemes coexist.

As mentioned, we've had to exercise this more than once during HEY lifetime. Now, the last bit that is missing is a mechanism to declare these previous encryptions schemes in a global way. So far in HEY I've done some monkey patching like this to add that :previous option globally:

module PreviousEncryptionScheme
  module EncryptableRecord
    def encrypts(*names, key_provider: nil, key: nil, deterministic: false, downcase: false, ignore_case: false, previous: [], **context_properties)
      previous = Array.wrap(previous)
      previous << ({ key_provider: key_provider, key: key, deterministic: deterministic, downcase: downcase, ignore_case: ignore_case,
        encryptor: OldEncryptor.new, message_serializer: OldMessageSerializer.new, cipher: OldCipher.new
      })
      super
    end
  end
  ...

But the idea I've always had in mind was adding something like this option instead:

active_record.encryption.previous = [ { ... } ] # global previous encryption contexts

I was actually close to implement this as part of this PR, it's an easy one, but I kept postponing it and decided to leave for a later PR 😅.

I appreciate the feedback from someone with experience maintaining a similar library. Please keep it coming if you have more thoughts.

@saghaulor-coupa
Copy link

I suppose that's a good compromise. We operated in a fashion where we wouldn't re-encrypt unless necessary (keys were compromised), so if the encryption context (and in our scenario that included DEK ID) changed, we'd just leave things in place. Consequently, across millions of records you could have several encryption contexts, but because it lives with the encrypted payload, everything just works. Your implementation suggests that you can only ever have two encryption contexts. Of course, your implementation is a little more storage efficient, whereas my suggestion requires more storage space but provides more flexibility in changing the encryption context.

@jorgemanrubia
Copy link
Contributor Author

@saghaulor-coupa

Your implementation suggests that you can only ever have two encryption contexts

You can have as many contexts as you want, it's a list of previous contexts. Now, it's true that it's not meant to have a long list of previous contexts, as that might impact decryption performance (it might have to try them all to find the right one). In that case the design you suggests would be much better.

As you say, it's all about tradeoffs. Thanks for sharing your experience here.

jorgemanrubia added a commit to basecamp/rails that referenced this pull request Mar 11, 2021
This adds new rake tasks for these tests. This way, it prevents these
tests from making the build fail.

rails#41659 (comment)
@inopinatus
Copy link
Contributor

Looks great. Just one thought: using deterministic encryption may be unnecessary for the ignore_case mechanism. I've seen implementations achieve similar outcome (particularly for email addresses) by recording the SHA-256 HMAC over a derived subkey & the case-folded attribute value via an ancillary column (e.g. named digest_<name>), that's then used only for indexing and lookup. Was this method considered in the cryptographic assessment?

@jorgemanrubia jorgemanrubia force-pushed the active-record-encryption branch 6 times, most recently from f13812a to 502ffde Compare March 12, 2021 13:33
@jorgemanrubia
Copy link
Contributor Author

@inopinatus That could certainly be possible, but it would require to introduce this special case-handling in several spots, including the extended search mechanism we have in place. I think I prefer the current approach where ignore_case: leverages on the existing deterministic: implementation.

@abhaynikam
Copy link
Contributor

@jorgemanrubia Great addition ❤️ . I think the changelog entry is missing in the PR 😊

@jorgemanrubia
Copy link
Contributor Author

Indeed! Thanks @abhaynikam, adding it here #41821

@geetfun
Copy link

geetfun commented Apr 2, 2021

Thanks for building this into the Rails stack!

@wwahammy
Copy link

@jorgemanrubia this looks like an amazing feature! I'm excited to look at using it.

I noticed that the guide and test uses the term "master key" in multiple places, most notably the guide and the tests. If I'm understanding the code correctly, I think this is actually referring to a "primary key". I assume this was changed before submitting to Rails from your internal implementation so it's understandable that some things can fall through the crack in the docs. Am I accurate in understanding that "primary key" is actually what is being referred to in places where you use "master key"?

@sljmn
Copy link

sljmn commented Apr 29, 2021

It is indeed the same as primary key
28145c3

I noticed that the guide and test uses the term "master key" in multiple places, most notably the guide and the tests. If I'm understanding the code correctly, I think this is actually referring to a "primary key". I assume this was changed before submitting to Rails from your internal implementation so it's understandable that some things can fall through the crack in the docs. Am I accurate in understanding that "primary key" is actually what is being referred to in places where you use "master key"?

@jorgemanrubia
Copy link
Contributor Author

jorgemanrubia commented Apr 29, 2021

@wwahammy You are right, it should be using "primary" everywhere, it seems I missed some when I did the change. I'll create a PR with renaming the last instances. Thanks for catching that!

UPDATE: PR here.

@brandoncc
Copy link
Contributor

This is amazing, thank you to everyone involved!

@butsjoh
Copy link

butsjoh commented Jun 16, 2021

Hi,

Just wondering if the encryption relies activerecord callbacks only. What happens if you would upsert (https://edgeapi.rubyonrails.org/classes/ActiveRecord/Persistence/ClassMethods.html#method-i-upsert) data? Since upsert does not instantiates a record i guess encryption will not be possible?

Just want a clarification, not asking to actually do this :)

casperisfine pushed a commit to Shopify/rails that referenced this pull request Aug 1, 2022
Ref: rails#41659

Mutating the attributes hash requires to workaround the mutation
in `find_or_create_by` etc.

One extra Hash dup is not big deal.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet