I drew this cartoon a week ago not knowing that it is going to fit much better for a situation I experienced only few days later when we performed a migration test of a software component. I was instructed to perform just about 5 exemplary spot tests and I was told that I will yet be informed via email about the purpose and coordination of the tests planned for the weekend.
Actually I never got any emails, so I just prepared my scripts and extended those with additional checking code last Friday. My automated scripts existed of 300 test cases and I checked them on an integration test environment before they were ready for the big show. All I knew about this test was, it started at 1030h on a Sunday morning. I didn't know who was involved and I didn't know anything about what is going to change in the software component. On this sunny Sunday morning at 1030h (my kids running around in our garden and waiting for me) I was sitting in front of my laptop, waiting for the starting shot, that is, an email telling me that the productive master data has been mirrored to the Acceptance Test Environment.
This is a normal procedure when we test a patch or a new release. It is a signal for the testers to start creating snapshots of the system "before" (automatically and some manually). Actually, I didn't get any emails on that Sunday morning and I also didn’t get any of those announced emails on Friday.
I decided to execute my scripts anyway and start collecting the initial state of the system. In parallel I informed my boss that I still haven't been involved in any discussions of theses tests, nor who would be coordinator of those tests. Only minutes later I got an email from the sysadmin, stating that the migration on the Acceptance Environment is completed and that the testers can now start their tests......
...Wait a minute, wasn't I supposed to collect the initial state first? Well, then let's take the snapshot of the system "AFTER", instead. The results on the Acceptance Environment after migration looked surprisingly great. They didn't differ from earlier tests that I executed on Friday night while I extended my scripts and while I tested them on two different integration environments. That circumstance made me believe that the test on the Acceptance Environment must have been successful although I didn’t really do the "BEFORE/AFTER" comparison on the same physical system. The chance for the migration test to fail on the final system was minimal, I thought. Anyhow this test would still be necessary, that’s for sure.
The only issue that worried me at that time, was the fact that I noticed a really big performance difference between Acceptance- and Integration-environment. "Acceptance" responded 7 times slower than "Integration", although "Acceptance", according to engineering was a 1:1 copy of the final system, running on several nodes, while "Integration" only ran on 1 node. How can that be? When another tester reported, he couldn't confirm my discovery, I repeated my tests several times on both environments, each time confirming my previous conclusion. During this analysis I suddenly got another email from the sysadmin telling me that now also on the final system, the migration had been completed. Great...! Now I couldn't do the "baseline grabbing" on the final system either. All I could do was to compare the good looking Acceptance with the final system. The problem was that the two results were different, very different. 20 test cases failed. After some investigation I found the reason for 19 failed test cases.
Those "negative" test cases each expected an empty response (because they got empty response from all other test environments), but got valid data from production. After a short analysis this could be explained with the fact that not all data was copied from the final system to the test environments. Fine, but we were still left with 1 failed test where the final system returned a slightly shorter response than all the other environments did. After reporting the results (now I had the names of the involved people), the managers suddenly got nervous, and I mean really nervous. They wanted an exact explanation which we couldn't provide in such a short timeframe.
The managers decided to stop this exercise and instructed Engineering to roll-back. The suggestion of the team to start the test from scratch, this time giving testers a chance to collect initial state of the system and then compare the data of the same system after migration, didn’t get through anymore although this whole procedure wouldn't take more than 30 minutes. To my opinion, this decision was wise, given the circumstances that the communication and coordination of this test wasn't optimal.
At the end, what confused me, is this: I was instructed to do only 5 tests, I actually executed 300 and it ended up with one minor abnomaly. What would the decison have been if I only executed the other 299 that passed?
The other vexing scenario is that after I got the confirmation that the system was rolled-back, none asked me to test the rolled-back system.
After spending so much energy in summarizing this excercise I should of course spend a moment and look at myself. How could I have prevented such situation to end up like this? Probably it helped if I started to collect the initial state before the official test started, and – despite the fact that I wasn't involved in any planning/organization meetings/emails/telco, I could still have published my approach of testing to the people of whom I assumed, may be involved. If I had taken the chance to publish my approach of testing the migration/patch before, it may have helped driving the test into a different direction.