Thursday, July 19, 2012

How a double-click downed the backend-system

The cartoon was originally drawn 2012 by me and it has its roots in us testers suffering from the fact that sometimes our bug reports were not written well enough for the stakeholders to understand the importance of some bugs.

Just early this year (2019), I had an interesting experience for which this cartoon fits even better. This is the story:

A few weeks ago, my automated API based test suite caused the system go to hell in a handbasket. The database service crashed and had to be recovered manually each time someone visited a grid that loaded data from the backend. After some investigation, together with a developer, we found the reason for this exceptionnel behavior that caused all users getting a system non-accessible message.

When I say users, I mean internal developers and testers, because luckily we were still far away from going live. It turned out the system persisted a duplicate UUID. This duplicate piece of data caused the system to crash whenever a query was reading the affected record. I corrected the corrupt entry in the database and the problem was solved. But how the heck did my test suite manage to introduce such duplication? I tried over and over again but never ever did I manage to make my automated tests do the same thing again. As a consequence, the problem was considered low priority based on the assumption the probability for this scenario to re-occur was almost zero. In fact, over a long period of time, this never happend again, until slap bang - during a manual test - I double clicked the OK button in the web page to persist a new object. After a response time of about half a minute, I got the same system non-accessible message. Jackpot!
With something as simple as a double click we got the backend system to crash from a dumb web client which resulted in a denial of service without me to flood the system with superfluous requests.
With this new information the previously low-rated issue all of a sudden got a different kind of attention. The probability of an end-user to double-click a UI element even in web clients is very high. The real reason for the duplication was the fact that the web client created a new UUID in memory already at the time a user was opening the web page for inserting objects. That means, it created the UUID before the user actually submitted the request to the server. When the user finally clicked OK that UUID was passed to the backend as part of other data entered by the user. When one double-clicks the button, the same dataset including the same UUID submitted the request twice in sequence. The backend had no unique index to check for duplicate UUIDs.

The most appaling part in this short story is that none of us ever thought of the double-click as a potential scenario to reproduce when we originally detected it. There were several experts and architects involved in the analysis of the bug and even though I had a great set of test patterns at hand (the double click was on the list), I was unable to think out of the box promptly. Weeks later, I had a weird flash of inspiration and it took me only seconds to reproduce this issue. The good thing is, we are still not live, so there is still time to fix it.
Add-on, August 2023:
The cartoon fits perfect to another defect that we detected more than a year ago. A grey-scaled scanned document, when edited and rotated in program A, could no longer be viewed in program B, both tools that were used in parallel. But, the anomaly didn't gain enough attention as it looked like an edge-case and customers never experienced any issues until "slap bang" a customer could no longer export their documents because of edited, grey-scale documents. What followed was weeks of investigation and experiments/workarounds, all without a hunky-dory solution. So yes, that cartoon was like a precursor for the next ugly thing to happen.

ThanX to the The Testing Planet magazine editors who were so kind to publish my cartoon in their issue 8