Dave, a good friend, sent me a WhatsApp message and made fun of a software that– when put under load - did not properly log invalid login attempts.
All he could see was “error:0” or “error:null” in the log-file.
This reminded me to a similar situation where we were hunting down the root causes of unhandled exceptions showing several times per day at customers out of the blue.
The log-files didn’t really help. We suspected a third party provider software component and of course the vendor answered that it cannot come from them.
We collected more evidence and after several days of investigation we could prove with hard facts that the cause was indeed the third party component. We turned off the suspected configuration and it all worked smoothly after that.
Before we identified the root cause, we were overwhelmed about the many exceptions found in the log-files. It was a real challenge to separate errors that made it to a user’s UI from those that remained silent in the background although something obviously went wrong with these errors, too. The architects decided the logs must be free from any noise by catching the exceptions and properly deal with it. Too many errors can hamper the analysis of understanding what’s really going wrong.
Two weeks later, the logfiles were free from errors. Soon after one of the developers came to me and moaned about a detection he just made. He stated someone simply assigned higher log-levels for most of the detected errors so these could not make it anymore into the logfiles. In his words, the errors were just swept under the carpet.
I cannot judge whether that was really true or whether it was done with just a few errors where it made sense, but it was funny enough to put a new sketch on paper.
No comments:
Post a Comment