With large electronic development projects, a disproportionate amount of time is spent troubleshooting bugs in “finished” devices.
In the worst cases, projects are wrapped up and products sold to customers even though they still have problems. Look at the recent recall of 2018 Chevy Malibu sedans. Many of these vehicles contained a software bug in their engine control module (ECM) that disabled fuel injectors and prevented the car from starting. This has led General Motors to recall over 100,000 vehicles to date.
Successfully troubleshooting these problems prior to product launch is in everybody’s best interest. In fact, it can be the most rewarding part of a project for an engineer. (There’s the pride from a sense of accomplishment and the relief from not having to worry about customers who are dissatisfied or, even worse, in danger.)
A Systematic Approach
Electronic circuits are complex interactions of hardware and software, providing limited information to the designer. Even the “simplest” electronic device may have hundreds of nodes, which can generate many different events. Typically, only one of these sequences of events is desired—the one in which the design successfully monitors all inputs and controls all outputs.
Engineers should expect that a complex design will not behave as desired when it first arrives from the manufacturer. The good news is that there’s an established method to apply deductive reasoning to identify root causes and fix these problems. The bad news is that it can take five minutes or five months. Its application requires a systematic approach and a tremendous amount of patience, creativity, and focus.
Step 1: Reproduce the Failure Symptom
The first step of the process is to reproduce the failure symptom. For intermittent failures, this can be the most time-consuming step. I once had a design failure that took days of plugging and unplugging a cable before the symptom recurred! A symptom that can’t be reproduced can’t be confidently fixed. Here are some techniques for reproducing intermittent problems:
- Gather as much information as possible from the failure’s witnesses.
- Apply environmental stresses such as thermal, mechanical, and electrical shock.
- Review data logs to identify failure sequences.
- Simulate the circuit and test the effects of hypothetical open and short circuits at different locations and sequences.
Timeline pressure often leads designers to make changes to address a symptom that can’t be recreated. Unfortunately, this is a waste of time and effort and will instill a false sense of confidence, since there’s no way to verify the effect of the change. An engineer should never implement design changes to address a failure before it can be reproduced.
Step 2: Simplification
Once a failure has been reproduced, the second step in the troubleshooting process is simplification. This is done by removing functional blocks, one at a time, while monitoring behavior to see if the problem has been fixed.
Typically, the blocks are removed sequentially from output back to the input. This step requires creativity to determine how to remove a function and still monitor the circuit’s behavior. One of the most challenging parts of the process is to disconnect feedback loops, because their automatic control mechanisms may mask the root cause(s) of the problem.
In these cases, the feedback loops should be replaced by external sources to mimic their outputs without automatic adjustments—make some room for extra power supplies! This simplification process should continue until the failure symptom has been eliminated.
Step 3: Reintroduction
After the symptom has been eliminated, the next step is to reintroduce the functional blocks back to the circuit. Blocks should be added one at a time, while the circuit’s output is simultaneously monitored to see when the problem recurs. When the failure occurs, the corresponding block must contain at least one of the failure’s root causes and should be removed again. Continue reintroducing blocks to identify those that are okay and those containing root causes of the failure.
These two steps (simplification and reintroduction) should be repeated within the problematic blocks to further narrow down the root cause. At the end of this sequence, there should be one or more components identified as root causes of the failure symptom.
Step 4: Fix Each Root Cause
The next step is to fix each root cause after they have all been identified. Each solution is obviously dependent on the cause, but generally requires fixing a design mistake or making a component more robust.
Step 5: Verification
The final step of the troubleshooting process is verification. First, the fixed circuit should behave correctly under the conditions that previously caused the failure. Then, each design fix should be reverted to its original state while the circuit is monitored, to verify that the original symptom recurs. The purpose of this step is to verify that the minimum design fixes have been identified, thus diminishing impact and implementation costs.
This troubleshooting approach is well-established, but it’s often ignored due to time pressure or overconfidence. While skipping these steps may sometimes turn out alright for simple problems, shortcuts will be counterproductive for complex problems with multiple root causes.
Kevin Murphy is Senior Principal Electrical Engineer at Bresslergroup.