SYSTEM RELIABILITY IMPROVEMENT
Most anecdotal power reliability problems customers see can be traced back to weaknesses in system-level reliabilitythe component application and system qualificationrather than the fundamental MTBF of the components themselves. For example:
The production version of the card draws more peak current than expected, causing the voltage to drop under extreme conditions.
The power system shuts down unexpectedly in the field (nuisance trips).
The card fails at the customer site, but when it is returned for repair, no faults are found (NFF).
Sequencing between rails depends on component tolerances and doesn't always meet the needs of the ICs.
Sequencing during shutdown wasn't considered during the design.
The power system cannot deliver full load at extremes of input voltage and temperature.
Power modules overheat due to restricted airflow when the card is installed in the equipment.
Although these types of problems can occasionally occur even in a well-designed system, the likelihood can be reduced significantly through careful design and thorough qualification testing. The table takes a closer look at these specific problems and offers tips on how they can be avoided.
Obviously, good power-system design is a complex, multifaceted subject that touches on the entire product and its environment. Don't underestimate the task's complexity. Furthermore, although the initial focus is on efficient power conversion, remember that the power-management functions share equal importance in achieving a good power-system performance.
MTBF IMPROVEMENT
Following three fundamental methods can improve the MTBF of any system. Use fewer components, make the components more reliable, and make the system function even if components fail. Each can play a part in improving power-system reliability, together with comprehensive qualification testing.
FEWER COMPONENTS
Often, component count can be reduced in the power-management system. A dedicated power-management IC can replace a large number of discrete components used for monitoring and control, such as comparators, op amps, optocouplers, and RC time delays. At the same time, a power-management IC can offer much better performance than a discrete solution, improving system reliability by accurately reporting marginal performance while avoiding nuisance trips.
For example, the Potentia PS-2610 measures each output rail voltage every 40 µs using an 8-bit analog-to-digital converter. The PS-2610 employs digital filtering to allow for fast response to a real OV condition while preventing false OV or UV shutdown due to voltage spikes.
A typical POL contains fewer internal components than an isolated brick, and the failure rate can be significantly lower. The manufacturer's quoted failure rate for a typical POL is about 200 FITs (equivalent to an MTBF of 5 million hours), whereas a typical brick is about 500 FITs (which is an MTBF of 2 million hours). On the other hand, a POL usually has lower output power than a brick, so you may need more of them to meet your total power requirement. Of course, reliability is only one of many factors when choosing power converters. But by considering reliability early in the design, you can make the best tradeoff for your application.
MORE RELIABLE COMPONENTS
Component reliability is influenced primarily by the qualification and quality-control processes used in manufacturing, as well as by the stresses applied in the application. Power-conversion reliability can be improved with a modular approach, using standard off-the-shelf dc-dc converters as components in your design. These units, which are built in high volume using an automated process with full quality control, offer excellent performance and reliability. You will avoid the need to calculate component stresses within the power converter, because the design is optimized during the manufacturer's in-house qualification.
Similarly, plan your power-management design around a dedicated power-management IC rather than a general-purpose device, like a gate array or microcontroller (MCU). A power-management design using an MCU or gate array requires extensive testing under both normal operation and fault conditions. This is to ensure that logical errors in programming don't cause incorrect behavior. Conversely, the dedicated power-management device's behavior is already fully tested and qualified by the manufacturer. Only the operating parameters (voltage levels, time delays) require programming.
FAULT TOLERANCE
To dramatically improve system reliability, design the system to be fault-tolerant. In the ideal case, an available backup instantly takes over for any component failure, leaving system performance unaffected. The term availability expresses the proportion of time for which the system performs as expected. The provision of backup components is called redundancy. In a practical system, there are limits to the degree of redundancy that can be achieved, and availability can never reach 100%. Through careful design, redundancy can provide almost complete protection against any single fault, and it can achieve 99.999% (five nines) availability or better.
Most redundant systems achieve redundancy by duplicating entire cards. For example, two identical control-processor cards can be used in a shelf, either of which can take control if the other fails. The 48-V distribution system also is duplicated, with dual 48-V feeds to each card from independent circuit breakers. If any individual circuit breaker trips, the cards still receive uninterrupted power through the second feed. In most cases, it's not considered beneficial to duplicate the on-card power system itself, since any card failure (power or otherwise) means simply replacing the card.
For effective redundancy, it's vital to report all component failures immediately to the operator for maintenance before the backup fails. In the power system, this implies not only comprehensive monitoring of all output-voltage rails, but also monitoring of fuses and power feeds to detect any loss of redundancy. Additional monitoring such as input-current measurement and thermal sensing can provide advanced warning of overload conditions and further improve reliability.
While today's power systems are more complex, high reliability is achievable. Minimizing component count can improve the failure rate and yield a high calculated MTBF. Also, with effective power management, you can implement features that improve overall equipment reliability. Remember that reliability is much more than just MTBF. Carry out thorough qualification testing of your power system to ensure it meets equipment requirements under all conditions.