Tuesday, August 15, 2023

Mutation Testing and why we don't need it, or do we?

 When our kids were still small, every Easter, it was a tradition to hide chocolate eggs, sweets and small presents in the garden, around the house, at the carport and sometimes also within the house.
While the kids were so excited to find all the little things, we parents watched them equally excited.

When Easter was long over, often, one or the other egg was still found by accident in a corner or somewhere in a plant pot; too old to still be edible. In other words, our kids didn't track down all of them at Easter. Over the years, they got better and better. We had to be more creative finding new extraordinary places to hide the little things from them, so they didn't have an easy catch ("long hanging fruits" how testers would say).

While we never spent a thought about our kids' "mathematical" effectiveness of finding all these little presents, this is exactly what mutation testing is all about.

It is a method to measure the effectiveness of unit tests in detecting anomalies in the code. The idea is to inject bugs by purpose and then verify how many of these are found. That's pretty much the same like hiding chocolate eggs in the garden.

A typical example of injected bugs (mutants) could be the change of a comparison operator from something like (x<y) to (x>y) or a boolean value that is changed from an initial value true to false or vice versa. In case of a calculation engine, the computed value could be fuzzed and made return an incorrect result. The point is, that these bugs are implemented by purpose and - in contrary to our annual tradition at Easter - the tool that modifies the code knows exactly how many mutants were added and where.

When executing a unit test, the mutation test tool compares the number of failed tests with and without the modified code. If the number of failed tests is the same for both scenarios, then this is an indication of inadequate tests.

I am not experienced in automated mutation testing, but I find this topic quite interesting, especially because IT-companies tend to measure just the test coverage but often, have no idea whether their unit-tests are really effective. Test Coverage doesn't tell you anything about the quality of the code. You can have 100% test coverage for one method and still fail miserably with uncaught exceptions by applying other valid inputs to the method.

Although mutation testing is usually done as part of automated tests using corresponding plug-ins, you can do mutation testing also manually. When I was drawing the cartoon, I was more focused on the manual aspect and less about the potential of using it to test existing automated tests.

Let's go a few steps back and look at our today's approach. We have a lot of manual tests (>1000), we also have a lot of unit tests (> 20'000), a very effective API test suite (> 3000 tests) and also a few UI tests (ca.100), following the typical test automation pyramid in terms of distribution of the tests, but we haven't integrated any sort of mutation testing yet.

I get emailed automatically whenever our testers find defects either through manual testing and/or by findings in the automated UI test-scripts and/or automated API tests. Based on the amount of emails received daily, I draw the conclusion that we are an effective test-team finding many defects. But, of course, it would be more interesting to learn whether we could even do any better. Are there even more bugs around to catch? Honestly, with the current amount of anomalies reported by my testers, my first reaction was rather defensive. Why I should inject any additional bugs by purpose? We have already enough to do while analyzing all the findings that slipped into the code unintentionally. This was also the original idea behind the cartoon, but..here is my mistake:

We have no facts at hand but simply a certain amount of defects we raise every week. 

Mutation Testing could help us collect more facts. Mutation Testing can not only be applied through tools, it can also be done manually. For example, if you want to understand how long it takes to find out a certain (obvious) bug introduced by purpose, just add it and let's see. You don't even need to inject code, you can also change a configuration that leads to a different (unexpected) behavior.

For example, one of my tested software creates documents with inquiries to doctors. A configuration allows the documents to be fit with a data-matrix code on pages the doctors have to fill out and return. When the letters are returned with the data-matrix code on it, a software-component can automatically identify the original request and related patient, then map it to the answer received. This enables quick access to both, original request-letter and response. 

The configuration could be turned off (by purpose), causing the created letters being sent out without a data-matrix code. How long do you think will it take until our testers notice the missing data-matrix code on the letters? 

I am pretty sure, it won't take long, because such a test is well documented in the regression test suite. But, what if we challenge them more - like making the letters print a hard-coded data-matrix code that is the same for all letters? 

It takes more efforts for a tester to find the problem. 

If the test is not documented, it is likely for the testers to miss the bug. If it is documented, it may still depend on the priority set for the test case whether the test is executed at all. If testers are all too confident that this piece is likely not to fail, they won't test it either.

If you inject such mutants, you need to be clear on your goal. Do you want to test the efficiency of the testers, the accuracy of the test cases or the effectiveness of automated tests?

3 comments:

  1. Humorous, but doesn't really seem to address the question. If not MT, how *do* you ensure that, or at least check whether, your test suite is any good? Coverage alone is nowhere near sufficient, as can be demonstrated by simply removing all the assertions and observing that your coverage does not decrease. If you want to trade funny pictures on the topic, I'll take my turn with this one I made: https://twitter.com/DaveAronson/status/1690353658167267328/photo/1

    ReplyDelete
  2. How do we check whether our test suite is any good? Any bug detected using our test-suite is already an indication it is good. We track all bugs detected by the suite with a JIRA label and can actually do reports with it. Coming back to coverage. That's another one I like. What does coverage really mean?
    A 100% code coverage can be obtained with a single test case already, eg. when you have a test case that passed 10 and 5 to a method that divides 2 integers. But what if in real life someone passes 0 as the second number? I have my "opinion" about these coverage metrics to be honest.
    At the end, what I am interested the most is, how many release-critical bugs did our customers find compared to us and how can we avoid such anomalies to reach the field in the future.

    ReplyDelete
  3. Detected bugs are good, but how do you know how many bugs are slipping through? You won't have a JIRA ticket for the ones that haven't been found yet, let alone the ones that never existed in your codebase. Mutation testing won't produce all possible bugs of course, but it should produce a good sampling of the possible simple ones. Not simple as in simple to avoid, else we obviously would, but as opposed to complex. According to the Competent Programmer Hypothesis, most bugs are from single small simple things, like operator reversal, limits off by one, typos (especially in languages where you don't have to declare variables), and so on. MT is a way to check whether you had any of those happen, before the customer does, or even your own QA team.

    We are in agreement re coverage, though!

    And nice to see someone else who remembers DCL and REXX, though I used it more under OS/2 than on mainframes. :-)

    ReplyDelete