[Engineering Feature]
We Have Seen The Enemy, And The Enemy Is Heat
Today’s complex SoCs are prone to thermal issues that can cause field failures. Here’s how thermal analysis can help you ferret out those hotspots.
WHY THERMAL ANALYSIS? Thermal effects, like most physical effects that plague deep-submicron processes, can be dealt with through guardbanding. But as savvy designers know, excessive guardbanding, i.e., designing for worst-case process, voltage, and temperature corners, will leave performance on the table. It also usually means decoupling the voltage, temperature, process variation, and timing analyses from each other.
But the room for margin continues to shrink anyway. According to Li-Pen Yuan, group R&D director for extraction and power-integrity tools at Synopsys, there are two major problems with the margin approach: "Even if we anticipate elevated temperatures and define a larger range to which we must design, there's no guarantee that the actual chip temperatures will be within that new margin. So we run the risk of violating it."
And, pushing too aggressively in terms of margin is what will cost you performance. "Thermal analysis should be used to understand the realistic distribution of temperatures on the chip in its various modes of operation. That way, we can potentially reduce the margin and not suffer from excessive guardbanding," says Yuan.
A typical thermal-analysis flow must examine three key elements. One involves the sources of heat and how to model them. Another is how that heat is distributed and/or dissipated. The third is determining when thermal equilibrium is reached.
Thermal models for the chip are created using the source model and the distribution network. The models draw upon technology parameters supplied by the silicon foundry. The distribution network is modeled using extraction in thermal tools such as Apache Design Solutions' Sahara-PTE thermal-analysis engine.
Those thermal models are supplemented by the boundary conditions describing their larger environment. According to Dian Yang, VP of product management and GM at Apache Design Solutions, the boundary conditions consist of models for the chip's package as well as for the board the package is attached to. Apache works with Ansys, whose IcePak and IceBoard tools generate those models and also interface with Cadence's Allegro Package Designer as well as Sigrity's Unified Package Designer.
Ansys' package-analysis tools are grid-based, finite-difference solvers that bring in geometric package shapes from either the electrical or mechanical CAD tools. It accounts for all of the physical elements, including the die, substrate, lead frame, die attach, and encapsulant. The tool automatically builds a meshed, finite-element model that breaks the package into tiny fragments. This facilitates predictions of temperatures at anywhere from 30,000 to 75,000 distinct locations within the package.
Once models for the chip, package, and board are in hand, thermal-analysis tools such as Apache's Sahara-PTE, Gradient Design Automation's FireBolt, or ArchPro Design Automation's MVSIM can take on the task of identifying thermal hotspots.
Designers can use these tools to determine whether they're meeting their maximum junction-temperature (TJMAX) specifications. Other tasks performed by thermal-analysis tools include verification of thermal gradients for various modes of IC operation and identification of the best locations for thermal sensors.
Interestingly, ArchPro considers thermal issues within the larger context of power management. Whether you use ArchPro's tools or not, it's worthwhile to ponder how hardware and software can conspire to mishandle systemic reactions to local thermal events.
For example, if a thermal diode in a cell phone trips and issues an interrupt, the CPU's local thermal throttling mechanisms may gate down the clock and phase-locked loop. However, the thermal interrupt goes unserviced. Consequently, the system continues to see an increase in leakage power and temperature. The resulting vicious cycle of leakage and heat can end in thermal runaway.
Traditional logic models often don't account for what each subsystem's logic is doing in the event of thermal problems. "When a thermal interrupt kicks in, many actions are set in motion," says Srikanth Jadcherla, ArchPro's founder and CTO. "Different subsystems are going to standby or shutting down, often in an uncoordinated manner." In traditional simulation, it may look as though the power-control system is responding to the interrupt, but, in fact, that may not be the case at all.
There are those who advocate an implementation flow that in some way, shape, or form considers power management and thermal concerns concurrently. All of the major GDSII-to-RTL flows on the market address this tack in various ways (see "Power And Thermal Analysis Are Best Done Together,").
Please refresh the page if you have trouble reading this text.
Search Electronic Design
Email Newsletter
Sponsored By:
The Find Power Products monthly newsletter brings you the most important new developments within the world of power design. The newsletter includes exerpts from industry leader Sam Davis's exclusive blog, as well as overviews of the latest new products.
Enter Email to Subscribe
Web Seminar
Sponsored By:
Title: Exploring How Good GUIs Drive Adoption in the Digital Power Management Space