Sunday, May 10, 2009

Test Automation Side Effect

Lesson learnt Run a test on environment X, collect reference data, then re-run the same test scripts and collect new data from a different environment Y. If you expect those results to be identical.Good luck.They are most likely not. If you run a test and collect reference data, always run these tests on the exact same physical machine and don't compare data with different machines, unless there is a strong reason for this. There are always things that are different on these machines.

Here is the plan 

The Engineering team expected me to perform just 5 exemplary spot tests to verify whether the response is fine for some web service requests after they upgraded a software component on ACCEPTANCE. I was told to get informed by email when ACCEPTANCE environment is ready to do an initial test.Their plan was to mirror the data from PRODUCTION to ACCEPTANCE first.

Engineering would email me, when this is operation is completed and I would start collecting the reference data by running my scripts. Later, when this is done, Engineering would upgrade the system and inform me, so I could re-run the scripts, collect the new data and compare those with the previously collected reference data. At least, this is what I thought was the plan.

Developing the scripts
I prepared my test-scripts and extended those with additional checkpoints that I found useful and necessary. I prepared over 300 test cases which were firing web service requests. That was far more than the team expected me to do. I tested my scripts on a development environment and collected and stored the results. Everything seemed to work fine. I was ready for the big show on ACCEPTANCE.
I didn’t know much about the exact time line. All I knew was, it started around 1030h on a Sunday morning. I didn't know who was involved and I didn't even know what exactly they changed in this new upgraded software. It was no functional change, just something minor on a database level.

Executing the test
On this sunny Sunday morning at 1030h (my kids running around in our garden and waiting for me) I was sitting in front of my laptop, waiting for the starting shot, that is, an email telling me that the productive data had been successfully mirrored to the ACCEPTANCE, so I can run my scripts. Actually, I didn't get any emails. Finally, after I asked what’s going on, I got a note from SYSADMIN, stating that not only the data but also the new software system was already deployed to ACCEPTANCE and that QA could now start their tests....

Wait a minute… wasn't the idea to first collect the initial reference data so we have something meaningful to compare with? Something went terribly wrong here.

Upps, we have an issue
However, I had no other choice than to simply run my scripts against an already upgraded system with no possibility to compare the results against an untouched system. What I had so far was the data collected from the development test environment from Friday when I was still developing and testing my scripts. When I compared the data I was worried because 20 tests failed and some web service requests fired by my scripts reported a remarkable performance difference between the upgraded ACCEPTANCE environment and the old untouched development environment. The upgraded system for some tests responded 7 times slower compared to DEV.

Investigation of failing tests
After some investigation I found the reason for 19 out of 20 failed tests. It was a false negative result. The reference data collected for these tests on DEV returned an empty response. I took that as a reference, but running the same scripts on a productive near environment returned something different. This explained the failing tests except for one for which the root cause was not identified yet; That was one out of 300, and of course, we still had this performance problem with a few tests.

I repeated all my tests several times to gain more assurance that I am on the right track. But for each test run executed, I got the same results. During this analysis I suddenly got another note from SYSADMIN, now telling me that also the final production system was upgraded….

WTF….! They upgraded production while I was still investigating what was wrong and before I could even provide them a report?

Consequence
After I vetoed and presented my results (finally I learnt who is my contact person), some managers got nervous, and I mean really nervous. They wanted an exact explanation of the failing test and the performance issue, but we were unable to provide such information in the required short timeframe.
As a result, the managers decided to stop this exercise and instructed Engineering to roll-back production immediately which actually helped avoid a disaster when the customer used our system the very next morning. To my opinion, this decision was wise, given the circumstances that the communication and coordination of this operation was a disaster.

5 test cases - are you serious?
At the end, what confused me, is this: I was instructed to do only 5 tests. Are you serious?
I developed and executed 300 and it ended up with one failing test. What would the result and decision have been if I only executed 5 tests? And what would have been the reaction of our customers?

The other vexing scenario is that after PROD and ACC system was rolled-back, none asked me to test the rolled-back system….

And what did I learn here?
OK, ..after spending so much energy in summarizing this exercise, I should of course spend a moment and look at myself. How could I have prevented such situation to occur? Probably it helped if I started to collect the initial reference data way before the official test started, and – despite the fact that I wasn't involved in any planning/organization meetings/emails/telco, I could still have communicated my approach of testing to the people of whom I assumed, may be involved. If I had taken the chance to present my approach of testing, it may have helped driving the test into a different direction.

And then it happened again
Many years later btw....I had a similar situation in a different company where a developer asked me to test peformance of single REST requests fired against a system before and after an upgrade. The BEFORE should be tested on environment X while the AFTER should be tested on environment Y.  The developer stated the machines are phyiscally identical, but I didn't like that argument. Actually they were both identical. X and Y where 2 different web pages running on the same physical machine, accessing the same physical database, but...running 2 different settings and code-basis.
X (the old system) was so so terribly slow that each request took 40 seconds and Y (the improved one) responded within 2 seconds. The developer cried "see how the performance increased". Actually X was so badly configured that it was impossible to prove the new version Y is faster because of his changes. When I demonstrated to him that there is yet another environment Z where the results are far better than those of X although Z had no optimization, he realized that X was a corrupt and bad bad environment to compare.