One reason testing is such a hard skill for developers to master is that the purpose of a test suite is rarely self-evident. Production code, meanwhile, rarely needs its motivation spelled out; after all, what is the purpose of an app that sells chocolate-dipped bananas but to… sell chocolate-dipped bananas? But the tests of such a system could serve myriad purposes: to prevent particular bugs from recurring, to provide an executable specification of its behavior, to promote simple code design, to enforce contracts with 3rd party APIs, or to rid oneself of the shame one feels when asked whether their code is tested.

As a result, even though testing is a secondary priority relative to shipping production code, its scope of responsibility is broader. If a test suite is to be at all comprehensible or maintainable, the team must define, communicate, and hold themselves to a discrete set of goals and constraints for their tests. This is why teams that prove themselves as perfectly competent at building applications often find themselves ill-equipped when it comes to testing.

Short of adopting an enlightened degree of thoughtfulness and an unyielding strategic discipline so as to invent a bespoke approach to software testing, what’s a team to do? One approach is to design each test suite with one or two narrow goals in mind, and then rigorously validate that they pay an appropriate return on investment (for more on that approach, check out this talk on test suite design). Another tactic is to hew to general heuristics known to guide test design, as they offer a mechanism for low-friction, continuous course correction—potentially staving off minor points of contention from escalating into drawn-out philosophical debates on the nature of testing.

One such testing heuristic—and one of my all-time favorites for promoting good test design—was put forward by the late, great Jim Weirich. Whenever Jim gave talks about testing practice, or about his tools like flexmock or rspec-given, he’d illustrate his approach to unit test design by holding every test to the standard that it be both necessary and sufficient.

What makes a test “necessary”? When isn’t a test “sufficient”? Why are these two attributes being juxtaposed and why would Jim have considered this such an important guiding principle of test design? Let’s begin by writing some bad tests that could be improved by applying this rule.

Necessary tests

Before saying much about a rule or how to follow it, it’s important to first consider the problem it attempts to solve. If someone tells you “all tests should be necessary”, you might ask: what does an unnecessary test look like? Well, consider this test:

module BananaStand
  class BananaTest < Minitest::Test
    def test_banana_price_with_just_nuts
      subject = Banana.new

      subject.add_topping(:nuts)

      assert_equal 70, subject.price
    end

    def test_banana_price_with_nuts_and_sprinkles
      subject = Banana.new

      subject.add_topping(:nuts)
      subject.add_topping(:sprinkles)

      assert_equal 70, subject.price
    end

    def test_banana_price_with_chocolate_and_sprinkles
      subject = Banana.new

      subject.add_topping(:chocolate)
      subject.add_topping(:sprinkles)

      assert_equal 80, subject.price
    end

    def test_banana_price_with_3_toppings
      subject = Banana.new

      subject.add_topping(:chocolate)
      subject.add_topping(:sprinkles)
      subject.add_topping(:nuts)

      assert_equal 80, subject.price
    end
  end
end

At first glance, the Banana#price method being exercised by the test seems like it must be doing some kind of math to calculate the banana’s price depending on which of three toppings are added, right? The fact that there are numerous test cases that combine different toppings suggests to the reader that the toppings must depend on each other in some way.

Now, let’s look at the code:

module BananaStand
  class Banana
    def initialize
      @toppings = []
    end

    def add_topping(topping)
      @toppings << topping
    end

    def price
      if @toppings.include?(:chocolate)
        80
      else
        70
      end
    end
  end
end

Wait a second, :sprinkles and :nuts don’t appear in the code listing at all! In fact, the only topping that impacts the banana’s price is the presence of :chocolate as a topping—and even then, it’s a simple if-else branch with two cut-and-dry code paths.

Several critiques could be levied against a test like this one. Most relevant to today’s discussion is that four test cases is two too many, because the code could be just as well-specified with just two cases—one banana with chocolate and one without.

Others might defend the test as being extra thorough (despite the fact coverage can’t go higher than 100%), as anticipatory of future edge cases (which usually means YAGNI), or as a realistic “black box” design (even though as a unit test, it’s hopelessly coupled to the internal API it exercises).

Yes, a team could have these debates each time an example like this was encountered (this happens! I’ve been on those teams!), but the truth is few people have enough time to write good tests as it is, and repetitive disagreements don’t make life any easier.

As an alternative to arguing the merits, consider adopting the first half of Jim’s simple heuristic instead: only test that which is necessary to fully exercise the code under test.

What might a “necessary” thinning of our Banana#price test look like? How about:

module BananaStand
  class BananaTest < Minitest::Test
    def test_banana_price_with_chocolate
      subject = Banana.new

      subject.add_topping(:chocolate)

      assert_equal 80, subject.price
    end

    def test_banana_price_with_no_chocolate
      subject = Banana.new

      assert_equal 70, subject.price
    end
  end
end

This reduced-calorie test is now so minimal that it’s liable to make some people uncomfortable—but, at least for now, it covers everything that the measly method does. In fairness, the #price method may grow to be more complex in the future… so let that be the day to complicate the test to match.

No rule is perfect, but this one is usually able to be assessed objectively. A useful numeric metric of “necessary-ness” in a test suite is to analyze and attempt to minimize redundant code coverage. Establishing a critical eye for unnecessary tests may change how you think about practices like generated tests, record-playback snapshot tests, and mocking libraries that disallow unexpected invocations.

Sufficient tests

So, if a test that does more than it needs to violates the “necessary” condition, then you’d be right to presume that an insufficient test fails to do enough.

“How much do I need to test this thing?” is a question that arises often when people are first getting a handle on unit testing, and it can be hard to get a straight answer. The most popular answer is probably “it depends”, but accepting that answer means forever submitting yourself to a world in which each-and-every unit test becomes a time-consuming series of nuanced judgment calls—and that’s no way to live.

Here’s a rule that’s easy to assess and hard to argue with: if the subject of a test were somehow deleted, the test will have been sufficient if any new implementation that passes the test would be considered working and complete.

As an example, here’s a test that will momentarily prove to be insufficient:

module BananaStand
  class RegisterTest < Minitest::Test
    def test_takes_a_dollar_no_dollars
      subject = Register.new(0)

      assert_equal 0, subject.take_a_dollar
    end

    def test_takes_a_dollar_fractional_dollars
      subject = Register.new(1.5)

      assert_equal 1, subject.take_a_dollar
      assert_equal 0, subject.take_a_dollar
      assert_equal 0.5, subject.dollars
    end

    def test_takes_a_dollar_multiple_dollars
      subject = Register.new(2)

      assert_equal 1, subject.take_a_dollar
      assert_equal 1, subject.take_a_dollar
      assert_equal 0, subject.take_a_dollar
      assert_equal 0, subject.dollars
    end
  end
end

Clearly, the Register is instantiated with some number of dollars in the till and the #take_a_dollar method will, if any dollars are left, return a single dollar and decrement the count.

Now, here is the production code this test exercises:

module BananaStand
  class Register
    attr_reader :dollars

    def initialize(dollars)
      @dollars = dollars
    end

    def take_a_dollar
      if @dollars >= 1
        Inventory.instance.throw_out_a_banana
        @dollars -= 1
        1
      else
        0
      end
    end
  end
end

Upon reading this, everything tracks with our expectations from having read the test, except for this line:

Inventory.instance.throw_out_a_banana

What’s that line doing there? Why doesn’t the test know about it? Does it mean this line doesn’t really matter? Perhaps testing this behavior was deemed too difficult and thus not tested at all?

Regardless the answer to these questions, they illustrate why code coverage will not save you: even though this example has 100% code coverage, its behavior is not fully tested. This is what people mean when they say code coverage is a “one-way metric”: it can only tell you where tests are absent, not where they are present.

If we deleted the Register class and re-implemented it using only the above test as a guide, we likely would have arrived at a virtually identical code listing with every behavior accounted for except for the call to throw_out_a_banana. For the sake of this discussion, let’s assume—even though it sounds silly—that throwing out a banana is an important and intentional behavior of this system. That being the case, we can conclude the test was not “sufficient”.

One way I like to get a sense for whether a bit of code is fully-tested is to delete standalone lines at random while I read through the source, repeatedly re-running its tests to ensure that the deletions trigger test failures. (This can be a fun, if cruel, group activity in front of a projector. Upon deleting a line, if the tests continue to pass, you might jokingly declare, “well, I guess that line wasn’t necessary!”)

Back to our example, how might we add an assertion for this line? We might use a test double library like gimme and add test cases for each code path:

def test_takes_a_dollar_throws_out_a_banana
  inventory = gimme(Inventory)
  give(Inventory).instance { inventory }
  subject = Register.new(1)

  result = subject.take_a_dollar

  assert_equal 1, result
  assert_equal 0, subject.dollars
  verify(inventory).throw_out_a_banana
end

def test_dont_throw_out_bananas_when_no_dollars
  inventory = gimme(Inventory)
  give(Inventory).instance { inventory }
  subject = Register.new(0)

  result = subject.take_a_dollar

  assert_equal 0, result
  assert_equal 0, subject.dollars
  verify(inventory, 0.times).throw_out_a_banana
end

Well, okay. Yuck. The subject is now fully-tested, but I’m hardly happy about it. I like to say hard-to-test code is usually hard-to-use code, and this test highlights four major design problems with this (6 line!) method:

  • The name take_a_dollar is a lie, since the method also affects our inventory of bananas. If it had been named honestly (i.e. take_a_dollar_and_throw_out_a_banana) it would have been clear the method was in violation of the single responsibility principle
  • By both returning a value and triggering a side effect, the method violates command-query separation
  • By referencing a singleton instance—as opposed to having an Inventory passed in or instantiating one itself—the state of each Register instance is coupled to the global Inventory class, which could easily lead to test pollution or hard-to-debug production errors
  • By parceling part of its work off to an Inventory dependency and doing part of its work itself by operating on a primitive counter, the subject is mixing levels of abstraction

Because it’s easy to sneak a one-liner into a method, any of these design problems could have been easily missed if we hadn’t taken the time to write these tests, thereby shining a light on the complexity that had been masked by the throw_out_a_banana call’s terseness.

This example demonstrates why strictly adhering to the “sufficient” rule by refusing to leave important behavior untested can be so valuable. If we’re ever tempted to skip testing some behavior, it’s probably because it’s hard to test, and the fact that it’s hard to test is probably because there are underlying design problems with the code. And there’s never going to be a better time to remediate a problematic design than before shipping it.

Conclusion

In sum, by ensuring that each test meets these two criteria, we’ll end up avoiding entire categories of test smells:

  • Test only the situations necessary to fully exercise all of the code’s behavior
  • Sufficiently assert that the code does everything a new implementation would need to do

Humbling as it is to admit this, it has taken me years of practice to internalize the healthy tension presented by these two rules, but now that I have, I’m able to work much more effectively in new and existing codebases alike. I’m grateful to Jim for having created this and so many other heuristics for how to write code. More than that, he normalized the idea that any one of us can create our own guidelines to help improve our software’s design and communicate nuanced topics among our teams. Feeling empowered to distill hard-fought lessons into our own simple rules of design has doubtlessly played a large role in making Test Double so successful at helping such a wide variety of our clients’ teams improve.

Justin Searls

Person An icon of a human figure Status
Double Agent
Hash An icon of a hash sign Code Name
Agent 002
Location An icon of a map marker Location
Orlando, FL