PinDown - an Automatic Debugger of Regression Failures

PinDown is an automatic debugger of regression failures. It performs an automated, validated identification of faulty code. In this context validated means that PinDown fixes the regression bugs locally and makes the failing test pass again in order to prove that it has identified the correct set of faulty code. As soon as a bug report has been validated it is sent to the engineer who committed the faulty code. PinDown's patented debug algorithm is unique. No other tool provides automatic, validated debug. 

The advantage of validation is accuracy, performance and the ability to handle advanced debug scenarios. The accuracy is important as engineers need to know that the bug reports can be trusted and are not just guesses. Bugs are fixed much faster if the engineers can take the content of the bug report as fact rather than a starting point for a debate. Validation is also the reason that PinDown can handle advanced debug scenarios where multiple commits together affect a test failure. Only validation can tell you which one of all the probable causes and combinations of commits that is the reason that a test is failing. The reason validation is good for performance is that it allows PinDown to use fast debug algorithms which are accurate most of the time, but not always (e.g. the binary search algorithm), and use validation as the final check before sending out the bug report.

autodebugprocess

 Fig 1. Automatic Debug Process

 The starting point are the regression test results produced by the customers own test scripts. PinDown handles any type of test results: constrained random or directed tests, single or multiple cause, test failures, build failures, intermittent and even metric tests. PinDown can basically debug anything defined as a failure by the customer.

The next step is to pin-point when and where a regression bug occurred. This is done by first grouping failures according to failure messages and historical test results and then selecting the fastest test in each group. Only the fastest test is used to find when this test started to fail. This is done by running older snapshots of the revision control system in order to try find when the problem started to occur. Typically 10 different snapshots are tested in parallel. The search algorithm is a binary parallel algorithm which can cover large amounts of commits with very little testing. E.g. if you have 1000 recent PinDown only needs to test 30 snapshots to find when the test started to fail.

The next step is to validate that the suspected commit is really the one and only bad commit that has caused this test to fail. This is done by going back to exactly the same revision where tests were run in the first place and then patch the code locally in such a way that the bad commit is undone. If the failing test now passes then PinDown knows it found the right culprit and it will immediately issue a bug report to the person who committed the bad commit which triggered the test failure.


 Video 1. PinDown Demo

If the validation returns a false result, i.e. the patch causes a test or build failure, then PinDown will continue the debug process. The reason the validation may fail may be because PinDown picked on the wrong commit or that there are two or more bad commits that affect the test results. It could also be that the patching away a commit was not possible because it was too tightly dependent on another commit, e.g. a change to a common API made in two different commits. Whatever the reason PinDown will continue to debug and not send out a bug report until it has made the failing test pass again. Note that this is a converging problem to solve. In the worst-case scenario PinDown would have to remove all commits that have been introduced since last time the test passed. For regression bugs, the question is how few commits can be undone by PinDown in order to make the test pass again. Normally it is 1 or 2 commits. It is very rare that 3 or more commits hang so tight together that they cannot be isolated. Since we launched PinDown on the market 2010 we have never seen more than 5 commits hang together.

Bug Report

The PinDown bug report connects each failure to a commit (or a combination of commits) in the revision control system. For small commits it even shows the exact lines that were changed
bugreport
 Fig 2. PinDown Bug Report
Note that the bug report says “Validated: true”. This means that PinDown successfully made the test pass again by undoing only this commit and nothing else. It means the person who made the commit (in this case “Daniel”) can be confident that his change is related to the test failure.This bug report is taken from this demo.

Advanced Debug Scenarios

The problem with debugging regression test failures is that there is not only one bad commit at any one time. Consequently it is not possible to say that the first commit that caused a test to fail is the reason why the test still fails on the latest revision. We measured how often this was wrong and found that in 41% of the cases it is a wrong assumption, meaning it is a terrible foundation for an automatic debug process. Who listens to a tool that is wrong in almost half of the cases? The reason PinDown does not get lost in these cases is because of validation. PinDown does not give up until it has made the test pass again on the latest revision. Instead of trying to find the perhaps historically interesting answer as to when things started to go wrong, PinDown answers the more relevant question of what is wrong now?

advanceddebug

Fig 3. Advanced Debug Scenarios

Overlaying Failures

There are many different debug scenarios, all of which are very common. The most common is overlapping failures where two bad commits both makes a test failing. In one such scenario the first bad commit has been fixed but only after a second bad commit was introduced (“Overlapping Failures 1”). In this scenario it would be completely wrong to point out the first bad commit as the reason why the test is still failing.

In another scenario none of the commits have been fixed (“Overlapping Failures 2”) which means that both needs to be fixed in order to make the test pass. In this scenario it is not sufficient to just send a bug report to the person who committed the oldest bad commit as this person cannot make the test pass again a by addressing just one of the two bad commits. The engineer must be provided the full picture and both committers need to be notified in order to be able to solve the problem.

Random instability

Constrained random testing introduces another challenge: random instability. When trying to reproduce a test failure the engineer must use the same seed number in order to provoke the same scenario. Random instability occurs when the same seed number does not produce the same scenario because another version of the test bench is being used, which is too different from the version where the test initially failed. The more changes that have been made (i.e. the more commits) the higher the probability that random instability will occur. When a user tries to reproduce a test failure in a bug report and just checks out the latest version from the revision control system this may sometimes happen. PinDown may also suffer from this when it goes back in time to older versions in order to try to find out when a test started to fail.

PinDown has a double-check mechanism to filter out most random instability issues. To start with, it uses the same seed number as in the reported test failure and back tracks in history until it finds the first time this test started to fail. The second check is to go back to exact version where the test was reported and try to make the test pass by only undoing the suspected bad commit. This eliminates the risk for random stability that is caused by any other commit, except for the commit that is reported in the bug report. The random stability is reduced to the commit (or commits) reported in the bug report. The bug report contains information about the files and lines that have changed and it is possible for the engineer to assess the risk that these changes has caused a random instability issue rather than an actual bug. For example if the changes are not in the test bench then the risk is zero. The risk for random instability is the same as if an engineer fixes a bug locally and runs a local test, which passes, before committing the fix. How does the engineer know that his fix actually fixed the problem and did not just cause some random instability which made the test pass simply because it was not reproducing the same scenario? In both cases, this is something the engineer can assess depending on what he actually changed, and in both cases the risk for random instability is low and contained to the change made by the user.

Intermittent Failures

PinDown can handle intermittent failures. It has an inbuilt mechanism called network integrity check which can handle temporary or incomplete failures and take actions such as deleting a checkout area and/or re-run a test to get a proper test result or ignore incoherent results. It is also possible for the user to extend this mechanism for tailor-made response which can be to run any command, script or program. The following demo shows some aspects of the network integrity check

 Video 2. Network Integrity

Build Failures

PinDown can also handle historical build failures, which is very important for an automatic debug tool. The history in the revision control system may contain build failures, i.e. historical snapshots where the code does not even compile. Even if these historical build failures have since been fixed there is a chance that a test failure was actually introduced during the period when the build did not even compile. What you really want is to fix the build failure for the snapshot you are interested in, in order to be able to run a test on this snapshot and see if the test passes or fails at this point in order to be able to narrow down the bad commit. This is exactly what PinDown does. It fixes the build failure and then operates as per normal with the parallel binary algorithm in order to narrow down where the problem started. On top of this, PinDown validates the bad commit on the new version where the test was reported initially as a failure (and where there was no compilation issue, otherwise it would have been reported as a compilation failure, which PinDown also can handle). The end result is as reliable as always; PinDown will make the failing test pass again by only removing the bad commit or commits.