Wednesday, October 2, 2024

The Birthday Paradox - and how the use of AI helped resolving a bug

While working / testing a new web based solution to replace the current fat client, I was assigned an an interesting customer problem report.

The customer claimed that since the deployment of the new application, they noticed at least 2 incidents per day where the sending of electronic documents contained a corrupt set of attachments. For example, instead of 5 medical reports, only 4 were included in the generated mailing. The local IT administrator observed the incident to happen due to duplicate keys set for the attachments from time to time.

But, they could not evaluate whether the duplicates were created by the old fat client or the new web based application. Both products (old and new) were used in parallel. Some departments used the new web based product, while others still stick to using the old. Due to compatibility reasons, the new application inherited also antiquated algorithms to not mess up the database and to guarantee parallel operation of both products.

One of these relicts was a four digit alphanumeric code for each document created in the database. The code had to be unique only within one person’s file of documents. If another person had a document using the same code, that was still OK.

 At first, it seemed very unlikely that a person’s document could be assigned a duplicate code. And, there was a debate between the customer and different stakeholders on our side.
The new web application was claimed to be free of creating duplicates but I was not so sure about that. The customer ticket was left untouched and the customer lost out until we found a moment, took the initiative and developed a script to observe all new documents created during our automated tests and also during manual regression testing of other features. 

The script was executed every once an hour. We never had any duplicates until after a week, all of a sudden the Jenkins script alarmed claiming the detection of a duplicate. That moment was like Xmas and we were so excited to analyze the two documents. 

In fact, both documents belonged to the same person. Now, we wanted to know who created these and what was the scenario applied in this case. 

Unfortunately, it was impossible to determine who was the author of the documents. My test team claimed not having done anything with the target person. The person’s name for which the duplicates were created occurred only once in our test case management tool, but not for a scenario that could have explained the phenomena. The userid (author of the documents) belonged to the product owner. He assured he did not do anything with that person that day and explained that many other stakeholders could have used the same userid within the test environment where that anomaly was detected.

 An appeal in the developer group chat did not help resolve the mystery either. The only theory in place was “it must have happened during creation or copying of a document”.  The most easy explanation had been the latter; the copy-procedure.

Our theory was that a copied document could result in assigning the same code to the new instance. But, we tested that; copying documents was working as expected. The copied instance received a new unique code that was different from its origin. Too easy anyway.

 Encouraged to resolve the mystery, we asked ChatGBT about the likelihood of duplicates to happen in our scenario. The response was juicy. 

It claimed an almost 2% chance of duplicates if the person had already 200 assigned codes (within his/her documents). That was really surprising and when we further asked ChatGBT, it turned out the probability climbed up to 25% if the person had assigned 2000 varying codes in her documents.

This result is based on the so called Birthday Paradox which states, that it needs only 23 random individuals to get a 50% chance of a shared birthday. Wow!

 Since I am not a mathematician, I wanted to test the theory with my own experiment. I started to write down the birthdays of 23 random people within my circle of acquaintances. Actually, I could stop already at 18. Within this small set I had identified 4 people who shared the same birthday. Fascinating!

 That egged us to develop yet another script and flood one exemplary fresh person record with hundreds of automatically created documents.

The result was revealing:


Number of assigned codes for 1 person

 

500

1000

1500

2000

2500

Number of identified duplicates (Run 1)

0

0

5

8

11

Number of identified duplicates (Run 2)

0

0

3

4

6

 With these 2 test runs, we could prove that the new application produced duplicates if only we had enough unique documents assigned to the person upfront.

 The resolution could be as simple as that:

  • When assigning the code, check for existing codes and re-generated if needed (could be time-consuming depending on the number of existing documents)
  • When generating the mailing, the system could check all selected attachments and automatically correct the duplicates and re-generate these or warn the user about the duplicates to correct it manually.

 What followed was a nicely documented internal ticket with all our analysis work. The fix was given highest priority and made into the next hot-fix. 

 When people ask me, what  I find so fascinating about software testing, then this story is a perfect example. Yes sure, often, we have to deal with boring regression testing or repeatedly throwing back pieces of code back to the developer because something was obviously wrong, but the really exciting moments for me are puzzles like this one; fuzzy ticket descriptions, false claims, obscure statements, contradictory or incomplete information, accusations and none really finds the time to dig deeper into the "forensics".

That is the moment where I cannot resist and love to jump in. 

 But, the most interesting finding in this story has not been betrayed yet. While looking at the duplicates, we noticed that all ended up with the character Q. 


 And when looking closer at the other non-duplicated codes, we noticed that actually ALL codes ended up with character Q. This was even more exciting. This fact reduced the number of possibilities from 1.6 million variants down to only 46656 and with it, the probability of duplicates.
 

At the end I could also prove that the old legacy program was not 100% free from creating duplicates even though it was close to impossible. The only way to force the old program to create duplicates was to flood one exemplary person with 46656 records, meaning all codes were used then. But that is such an unrealistic scenario, it was already pathetic.


 

 

Further notes for testers
Duplicate records can be easily identified using Excel by going to DataTools and choose "RemoveDuplicates".


 

 

 

Or, what was also helpful in my case was the online tool at 

https://www.zickty.com/duplicatefinder
where you could post your full data set and it returned all duplicates.
 

Yet another alternative is use to SQL group command like in my case

SELECT COUNT(*) as num, mycode FROM table WHERE personid=xxx GROUP BY mycode ORDER BY num DESC;

 And...my simple piece of code in C# which I used within the automated scripts to alarm when the returned value was higher than 1. 







 

 

 

 

Thursday, July 11, 2024

Bug's survival strategy

We recently stumbled over a bug hiding at a place where I really had to sigh:

"Sorry, really?"

The customer complaint that their documents could not be exported anymore. The system ended with a null pointer exception. Initially, we could not reproduce the problem in our internal test environment until we injected a real production document from the customer and tried that out. So far, okay. But we didn't stop there. We wanted to understand, what caused the problem. First, we assumed the resolution of the scanned document may be different (as we had some issues in the past heading towards this direction). We were wrong, that was not the problem. The resolution of the document was identical to what we had internally already.

 Then, we started to play with annotations as there were some weird errors in the back-end log files pointing to such direction. We added a marker-, a line- and a text-annotation to the document, saved it and tried again. Same issue. Then, we did the same on the second page. When we added an annotation there and marked it as a burned in annotation, the problem did NOT occur anymore for that imported document. That was an interesting observation.

 To really be sure we hit the jackpot, we removed the burned-in attribute and tried again. Indeed, now it threw the same exception again. Setting the burned-in annotation again and re-checking...yes, now worked again.

 Then, we did the same test with annotations on the first page. Not the same issue here. So, looked like it really played a significant role where the annotations were set and with which kind of attribute value.

 What a wonderful finding and great hint for our developers. That helped speeding up the fix.

After the outcome of this analysis, we asked ourselves whether we can expect the testing really to test all the possible variations like checking whether annotations of all kinds work on all sorts of page numbers using different kinds of documents such as colored, grey-scale, different DPIs, etc, etc. and all this in a time-frame that is far from fair.

 Fact is, no matter how hard and how accurate your testing is done, software will never be fully free of bugs. The complexity of software has increased dramatically during the last few decades and the number of needed test cases to gain a minimum level of coverage has grown so much, that it has become impossible to execute all tests and possible variations within the time given.

 "Even a seemingly simple program can have hundreds or thousands of possible input and output combinations. Creating test cases for all these possibilities is impractical. Complete testing of a complex application would take too long and require too many human resources to be economically feasible". [1]

 Software development today still is a process of trial-and-error. I have never seen a programmer developing a software component from scratch without several times of failed compiling, fixing, re-writing, re-testing and finally shipping it, then failing at the customer and going through the same process again.

 "Software errors are blunders caused by our inability to fully understand the intricacies of these complex products" [2]

 Ivars Peterson compares today's technology with black magic where engineers act like wizards who brew their magic potion by mixing various ingredients of terminology that only them understand while the public dazzled by the many visible achievements of modern technology often regards engineers as magicians who can solve any problem.

References

[1] The Art of Software Testing, Glenford J. Myers
[2] Fatal defect, Ivars Peterson

Tuesday, April 16, 2024

AI and a confused elevator

A colleague recently received a letter from the real estate agent, stating that several people reported a malfunction of the new elevator. The reason as it turned out after an in depth-analysis: the doors were blocked by people moving into the building while hauling furniture. This special malfunction detection was claimed to be part of the new elevator systems that is based on artificial intelligence.
The agent kindly asked the residents to NOT block the doors anymore as it confuses the elevator and it is likely for the elevator to stop working again. 

I was thinking..."really"?

 I mean...If I hear about AI in elevators, then the first thought that crosses my mind is smart traffic management [1]. For example, in our office building, at around noon, most of the employees go for lunch and call the elevators. An opportunity to do employees a great favor is to move back to the floor where people press the button right after having delivered the previous set of people. Or, if several elevators exist, make sure the elevators move to different positions in the building so people never have to wait too long to get one. 

But, I had never expected an elevator to get irritated and distracted for several days for the case where someone temporarily blocks the doors. It is surprising to me that such elevator has no built in facility to automatically reset itself after a while. It's weird that a common use-case like blocking doors temporarily wasn't even in the technical reference book and required a technician to come twice as he/she could not resolve the problem the first time.

A few weeks later, I visited my friend in his new appartment and I wanted to also see that  mystic elevator. D'oh! Another brand new elevator that does not allow to undo an incorrect choice made while clicking the wrong button. But, it contains an integrated iPad showing today's latest news.
Pardon me? Who needs that in a 4 floor building?

Often, I hear or read about software components where marketing claims they are using AI, whereas in reality the most obvious use-cases were not even considered like that undo button [2] which I'll probably miss in elevators till the end of my days.

 References

[1] https://www.linkedin.com/pulse/elevators-artificial-intelligence-ascending-towards-safety-guillemi/

[2] https://simply-the-test.blogspot.com/2018/05/no-undo-in-elevator.html

 

Wednesday, November 29, 2023

Software made on Earth


This is a remake of my original cartoon which was published at SDTimes, N.Y. in their newsletter as of April 1, 2008

Sunday, November 5, 2023

Testing under the Hood

 Even when a particular test passed at first glance, there might still be things going wrong. You may just not have noticed it, because the User Interface stays quiet; at least for the moment. Things can go wrong in a black box long after you executed the test; hours, days or even weeks later. The longer such problems remain undetected the more effort it takes to fix the problem and repair the damage it caused, especially if the system is already LIVE in production. See also Cheerful Debugging Messages and its Consequences in this blog.

It is not enough to look only at the front-end of an application. You should also watch carefully what’s going on behind the curtains. Give all testers a facility to query the underlying database. A lot of things can go wrong there and remain undetected for too long. It will start hurting only when such data is shared with or passed to other programs using a corresponding interface to read or exchange data. I have seen a lot of things stored inappropriate only to hurt when such data was later used by another program.

I developed an SQL query tool with some extra facilities like an analyser to compare all tables before and after a triggered action.

 How can testers live without such tools? It opens a whole new universe of potential problems just waiting to get reported.

Wednesday, September 27, 2023

Revise the Test Report

I thought, I'd published this cartoon in 2020 already, but couldn't find it, so I am doing it now, with a 3 years delay...=;O)
 


Tuesday, August 15, 2023

Mutation Testing and why we don't need it, or do we?

 When our kids were still small, every Easter, it was a tradition to hide chocolate eggs, sweets and small presents in the garden, around the house, at the carport and sometimes also within the house.
While the kids were so excited to find all the little things, we parents watched them equally excited.

When Easter was long over, often, one or the other egg was still found by accident in a corner or somewhere in a plant pot; too old to still be edible. In other words, our kids didn't track down all of them at Easter. Over the years, they got better and better. We had to be more creative finding new extraordinary places to hide the little things from them, so they didn't have an easy catch ("long hanging fruits" how testers would say).

While we never spent a thought about our kids' "mathematical" effectiveness of finding all these little presents, this is exactly what mutation testing is all about.

It is a method to measure the effectiveness of unit tests in detecting anomalies in the code. The idea is to inject bugs by purpose and then verify how many of these are found. That's pretty much the same like hiding chocolate eggs in the garden.

A typical example of injected bugs (mutants) could be the change of a comparison operator from something like (x<y) to (x>y) or a boolean value that is changed from an initial value true to false or vice versa. In case of a calculation engine, the computed value could be fuzzed and made return an incorrect result. The point is, that these bugs are implemented by purpose and - in contrary to our annual tradition at Easter - the tool that modifies the code knows exactly how many mutants were added and where.

When executing a unit test, the mutation test tool compares the number of failed tests with and without the modified code. If the number of failed tests is the same for both scenarios, then this is an indication of inadequate tests.

I am not experienced in automated mutation testing, but I find this topic quite interesting, especially because IT-companies tend to measure just the test coverage but often, have no idea whether their unit-tests are really effective. Test Coverage doesn't tell you anything about the quality of the code. You can have 100% test coverage for one method and still fail miserably with uncaught exceptions by applying other valid inputs to the method.

Although mutation testing is usually done as part of automated tests using corresponding plug-ins, you can do mutation testing also manually. When I was drawing the cartoon, I was more focused on the manual aspect and less about the potential of using it to test existing automated tests.

Let's go a few steps back and look at our today's approach. We have a lot of manual tests (>1000), we also have a lot of unit tests (> 20'000), a very effective API test suite (> 3000 tests) and also a few UI tests (ca.100), following the typical test automation pyramid in terms of distribution of the tests, but we haven't integrated any sort of mutation testing yet.

I get emailed automatically whenever our testers find defects either through manual testing and/or by findings in the automated UI test-scripts and/or automated API tests. Based on the amount of emails received daily, I draw the conclusion that we are an effective test-team finding many defects. But, of course, it would be more interesting to learn whether we could even do any better. Are there even more bugs around to catch? Honestly, with the current amount of anomalies reported by my testers, my first reaction was rather defensive. Why I should inject any additional bugs by purpose? We have already enough to do while analyzing all the findings that slipped into the code unintentionally. This was also the original idea behind the cartoon, but..here is my mistake:

We have no facts at hand but simply a certain amount of defects we raise every week. 

Mutation Testing could help us collect more facts. Mutation Testing can not only be applied through tools, it can also be done manually. For example, if you want to understand how long it takes to find out a certain (obvious) bug introduced by purpose, just add it and let's see. You don't even need to inject code, you can also change a configuration that leads to a different (unexpected) behavior.

For example, one of my tested software creates documents with inquiries to doctors. A configuration allows the documents to be fit with a data-matrix code on pages the doctors have to fill out and return. When the letters are returned with the data-matrix code on it, a software-component can automatically identify the original request and related patient, then map it to the answer received. This enables quick access to both, original request-letter and response. 

The configuration could be turned off (by purpose), causing the created letters being sent out without a data-matrix code. How long do you think will it take until our testers notice the missing data-matrix code on the letters? 

I am pretty sure, it won't take long, because such a test is well documented in the regression test suite. But, what if we challenge them more - like making the letters print a hard-coded data-matrix code that is the same for all letters? 

It takes more efforts for a tester to find the problem. 

If the test is not documented, it is likely for the testers to miss the bug. If it is documented, it may still depend on the priority set for the test case whether the test is executed at all. If testers are all too confident that this piece is likely not to fail, they won't test it either.

If you inject such mutants, you need to be clear on your goal. Do you want to test the efficiency of the testers, the accuracy of the test cases or the effectiveness of automated tests?

Saturday, May 13, 2023

License Expired

 In an amusing short video from CNN[1], Alexei Navalny, a Russian opposition leader and anti-corruption activist, explained the meme MOSCOW4. It is representative for the stupidity of Putin’s command structure which - according to Navalny - consists of an array of complete morons. He underlined the statement by naming an example with one of them who was hacked his email passwords several times in sequence. The first password was “Moscow1”, then “Moscow2”, etc.  

After we ourselves managed several times to forget updating expiring license keys for our customers, I remembered this story. We are not any better and I thought, it was about time to honour our repetitive mishap with a corresponding cartoon. For the dinosaurs, I was experimenting with a different kind of filling grey; something Gary Larson had in his cartoons, too.


[1] https://edition.cnn.com/videos/world/2022/04/19/navalny-moscow-4-origseriesfilms-3.cnn

Sunday, March 12, 2023

Lindy's Law in Test Automation

by T. J. Zelger, March 12, 2023

When I developed my first "robot" 20 years ago with the goal to automatically test the software so that my team didn't have to do the same manual tests every day, there were at most a handful of products enabling it. There were tools from well-known companies like IBM and Mercury (now HP), and these were extremely expensive. You didn't have much choice. Once you made a decision for a tool, it was almost impossible to revert that decision and go for another. It inevitably resulted in enormous extra costs.

A little later, a few interesting and cheaper alternatives emerged, such as RANOREX, an Austrian company that soon taught the big ones to fear because of their quality and an attractive value for money.

We also experimented with other products that we used for specific tasks and later replaced with other newer/better ones. Among these, I had a Canadian product on my focus which provided us with valuable services when testing a vehicle valuation calculation engine. As far as I know, the product no longer exists today and my memories are patchy, but I believe it was a forerunner of one of the open source systems that are widely used today or it may have been something similar.

20 years later, you will find a flood of even cheaper or free offers. A closer look reveals that most products are based on a few identical core modules. Selenium is currently one of the most popular "engines" on which most tools are based on. I also use Selenium, and because it's actually just an "engine", you have to develop additional methods and modules on top of it to make it a stable and easy to maintain test automation suite.

Nowadays, when building your own framework, most stick to the page-object model. However, we used to have a different approach. Our test data and instructions (action words) were kept in Excel. The idea was to enable testers with no programming skills to write automed tests. At the beginning we even thought, we could convince our business-analyts to maintain their own tests.

The idea to keep data and keywords out of the code was not new. It already existed at the time I was still working with IBM Rational Robot. For example, the SAFS Framework by Carl Nagle [1] was the first framework I learnt about following a similar approach or take TestFrame which is an implementation of Hans Buwalda's so called "Action Words" [2].

Our Excel based framework was quite a success within our headquarters in Switzerland and Germany and our plan to have testers write their own scripts without programming skills went well, but the maintenance of our framework wasn't quite that easy and it needed an expert to maintain all the UI locators and required extensions. This was sometimes a little too tricky for the non-techies. And, we never managed the business analysts to go for it.

Later (in a different company), I used the same approach but realized that it didn't have the same effect if you have testers WITH programming skills. Excel isn't seen cool enough to write automated tests and if you sell such approach to techies, they will raise their eye brows.

So, we removed the Excel part and integrated everything into NUNIT. That was easier to debug also.

And now? A newcomer called Cypress enters the market [3]. As I don't want to get stuck in sweet idleness, we are now starting a new adventure to see what it has to offer. We still keep our Selenium scripts, but they are going into a maintenance mode right now.

But, do we really have to follow each new fashion trend? Who guarantees that new stuff and ideas are really better than what we already have in place?

Fortunately, in my position as QA Manager, I can mostly set the goals myself in this area. If things are going well, you have the dilemma between "don't touch a running system" and ensuring we are not missing something. 

Today, we use Selenium/C# with NUNIT for automated UI tests triggered daily by Jenkins. And, we have an automated test suite that fires requests on an interface level (below the UI) following the Test Automation Pyramid approach [4].

The problem: Everything works smooth since years! Why should I spend time to investigate alternatives?

I am thinking here of the Lindy's Law [5]. If something has proven itself for a long time, there is a high probability that it will continue to prove itself in the future. In my case, this applies in particular to these automated API tests, which are based on a framework we developed ourselves with the aim to keep the test scripts at the highest possible level of abstraction. The technical details such as authentication communication with the backend remain hidden. Also, instead of just having JSON based input/output, we are dealing with deserialized business objects. 

We have at least 3000 automated tests that have identified bugs that were not caught in the deveoper's unit-tests. Simply said, these interface tests are a success story and I don't spend a second thinking about replacing it with a standard product. Why should we scan the market for "better" stuff that maybe isn't?

Because the examination of alternatives does not necessarily lead to the replacement of a tried and tested system. It can become a supplier of interesting and new ideas and extensions for the existing solution. Being open minded also helps recognizing the limits of one's own system and thus to check whether the current version can last not only in the near, but also far future and/or can be supplemented with one or the other useful feature.

The only constraint I am dealing with in this regard is my available time. But that's another story.

References:

[1] SAFS, Carl Nagle

[2] "Action Words" by Hans Buwalda, Software Test Automation (Fewster/Graham), Addison-Wesley

[3] https://www.testautomatisierung.org/testautomatisierung-cypress-vs-selenium/

[4] Test Automation Pyramid, Fowler, https://martinfowler.com/articles/practical-test-pyramid.html

[5] Lindy's Law by Albert Goldman, 1964, https://www.sciencedirect.com/science/article/abs/pii/S0378437117305964 and "Das Magazin", Nr. 10, 10-11. März 2023