[Engineering Feature]
We Have Seen The Enemy, And The Enemy Is Heat
Today’s complex SoCs are prone to thermal issues that can cause field failures. Here’s how thermal analysis can help you ferret out those hotspots.
David Maliniak
ED Online ID #14949
March 1, 2007
Copyright © 2006 Penton Media, Inc., All rights reserved. Printing of this document is for personal use only.
Reprints
As the semiconductor industry traverses
through the deep-submicron process nodes,
each plateau along the way carries its own
signature bugaboo arising from physical
effects. At 180 nm, timing-closure issues got
everyone's attention. At 130 nm, signal
integrity was the topic of the day. At 90 and
65 nm, though, power integrity and leakage
are weighing on designers' minds. We now pack so many active
elements onto such a small slab of silicon that power density
has reached near-critical mass. For example, according to Srikanth Jadcherla, founder and CTO of ArchPro Design Automation, a die measuring 1 by 1 cm with power consumption of 1 W dissipates the equivalent of 10 GW per square kilometer, or 25 GW per square mile.
Along with enormous increases in power density comes
the physics of the submicron realm. With narrower feature
sizes come thinner gate insulators, and that translates into
leakage power. Leakage across gates is a condition in which
the gate never shuts entirely off. Rather, it continues to consume power even though it's in a nominally passive state. At
the 65-nm node, leakage can constitute more than 40% of
the overall power consumption of a system-on-a-chip (SoC)
or ASIC (Fig. 1).
Unfortunately, leakage has a symbiotic, and positively
reinforcing, relationship with temperature. Leakage begets
heat, which begets more leakage, which begets even more
heat. And, in worse-case scenarios, thermal runaway can
ensue, leading to potential fires and/or explosions in enduser systems.
Thus, heat is indeed an enemy that must be faced head-on.
Fortunately, designers can turn to a number of tools and
methodologies for prediction and management of thermal
effects. In this article, we'll explore some of the thermal-analysis
methods that help unearth problem areas. We'll also discuss
some best practices in the thermal-management arena.
LEAKAGE IS A KEY
In addition to its exponential relationship with temperature, leakage is at the root of more subtle, yet no less pernicious, effects. Chief among these are
problems brought on by electromigration, which are exacerbated by the higher current densities.
Then there's the broader issue of thermal variation across
a given die's planar dimensions—even in the Z dimension
between metal layers. Not only do disparities exist in temperature at a great many points on and within the die, but
those variations are far from constant. As major functional
blocks turn on and off, switching activity will have an ongoing effect on the die's thermal characteristics.
THE PERFECT STORM
There is, in fact, an interconnected maze of effects brought about by temperature variation
that involves timing, signal integrity, and reliability (Fig. 2).
As mentioned, temperature has a positive feedback loop with
power and leakage. But it also affects timing by weakening
the driving capability of devices. Higher temperatures mean
an increase in the passive resistance of interconnects, which
in turn increases delays.
The effect of temperature on IR drop and electromigration
is accomplished primarily through Joule heating, or self-heating of the interconnects. This is another result of the increased
resistance of the wires due to elevated temperatures. The circuit's electromigration lifetime degrades exponentially with
rising temperatures. In IR-drop terms, that increased resistance on the power and ground grids leads to larger IR drops,
meaning more power consumption.
WHY THERMAL ANALYSIS?
Thermal effects, like most
physical effects that plague deep-submicron processes, can be
dealt with through guardbanding. But as savvy designers know, excessive guardbanding, i.e., designing for worst-case process,
voltage, and temperature corners, will leave performance on the
table. It also usually means decoupling the voltage, temperature, process variation, and timing analyses from each other.
But the room for margin continues to shrink anyway.
According to Li-Pen Yuan, group R&D director for extraction and power-integrity tools at Synopsys, there are two
major problems with the margin approach: "Even if we
anticipate elevated temperatures and define a larger range to
which we must design, there's no guarantee that the actual
chip temperatures will be within that new margin. So we run
the risk of violating it."
And, pushing too aggressively in
terms of margin is what will cost you
performance. "Thermal analysis should
be used to understand the realistic distribution of temperatures on the chip in
its various modes of operation. That
way, we can potentially reduce the margin and not suffer from excessive
guardbanding," says Yuan.
A typical thermal-analysis flow must
examine three key elements. One
involves the sources of heat and how to
model them. Another is how that heat
is distributed and/or dissipated. The
third is determining when thermal equilibrium is reached.
Thermal models for the chip are created using the source model and the
distribution network. The models
draw upon technology parameters
supplied by the silicon foundry. The
distribution network is modeled using
extraction in thermal tools such as
Apache Design Solutions' Sahara-PTE
thermal-analysis engine.
Those thermal models are supplemented by the boundary conditions describing
their larger environment. According to
Dian Yang, VP of product management
and GM at Apache Design Solutions, the
boundary conditions consist of models
for the chip's package as well as for the
board the package is attached to. Apache
works with Ansys, whose IcePak and IceBoard tools generate those models and
also interface with Cadence's Allegro
Package Designer as well as Sigrity's Unified Package Designer.
Ansys' package-analysis tools are
grid-based, finite-difference solvers that
bring in geometric package shapes from
either the electrical or mechanical CAD
tools. It accounts for all of the physical
elements, including the die, substrate,
lead frame, die attach, and encapsulant.
The tool automatically builds a meshed,
finite-element model that breaks the
package into tiny fragments. This facilitates predictions of temperatures at anywhere from 30,000 to 75,000 distinct
locations within the package.
Once models for the chip, package,
and board are in hand, thermal-analysis
tools such as Apache's Sahara-PTE,
Gradient Design Automation's FireBolt,
or ArchPro Design Automation's
MVSIM can take on the task of identifying thermal hotspots.
Designers can use these tools to
determine whether they're meeting
their maximum junction-temperature
(TJMAX) specifications. Other tasks
performed by thermal-analysis tools
include verification of thermal gradients for various modes of IC operation
and identification of the best locations
for thermal sensors.
Interestingly, ArchPro considers thermal issues within the larger context of
power management. Whether you use
ArchPro's tools or not, it's worthwhile
to ponder how hardware and software
can conspire to mishandle systemic
reactions to local thermal events.
For example, if a thermal diode in a
cell phone trips and issues an interrupt, the CPU's local thermal throttling mechanisms may gate down the clock and
phase-locked loop. However, the thermal
interrupt goes unserviced. Consequently,
the system continues to see an increase in
leakage power and temperature. The
resulting vicious cycle of leakage and
heat can end in thermal runaway.
Traditional logic models often don't
account for what each subsystem's logic
is doing in the event of thermal problems. "When a thermal interrupt kicks
in, many actions are set in motion,"
says Srikanth Jadcherla, ArchPro's
founder and CTO. "Different subsystems are going to standby or shutting
down, often in an uncoordinated manner." In traditional simulation, it may
look as though the power-control system is responding to the interrupt, but,
in fact, that may not be the case at all.
There are those who advocate an
implementation flow that in some way,
shape, or form considers power management and thermal concerns concurrently.
All of the major GDSII-to-RTL flows on
the market address this tack in various
ways (see "Power And Thermal Analysis
Are Best Done Together,").
THE VIEW FROM TI
For large systems houses, thermal analysis and management are taken extremely seriously.
In the case of Texas Instruments, it's the
subject of a company-wide initiative.
"We have what's called the Thermal
Council here at TI," explains Darvin
Edwards, manager of advanced package
modeling and characterization and TI
Fellow. "One of the intents of the Council is to educate each of the various business groups within TI as to the nature of
thermal issues they'll face in their products." The Council meets to share lessons
learned from various design projects.
Within TI, thermal analysis is a standard part of the design flow. Design teams run through analyses to determine whether there will be problems.
"We have some rules of thumb," says
Edwards. "For example, we check to
see if there's going to be more than a 2X
differential in temperature gradients
across the die." If there are concerns,
the product engineers are made aware
of the hotspot issues and a power map
may be generated for the die.
In the event of such issues, TI follows
some best practices in efforts to ameliorate them. For one thing, the engineers
will consider reducing the impact of
hotspots by attaching the die directly to
a high thermal-conductivity heat spreader, such as a copper plate. Then, if a die
with hotspots happens to be a thinner
die (say, 50 µm in thickness versus 400
µm), that would imply the need for
chip/package co-design. A special case
concerns packages with stacked die, in
which hotspots on one die within the
package can create hotspots on another.
Engineers at TI try not to cluster
hotspots, if at all possible. Spreading
them apart keeps each hotspot away
from the "thermal footprint" of neighboring hotspots, keeping each of them
cooler. This practice applies to pcboard design as well as to IC design.
If a given die has only one hotspot, the
best place for that hotspot is in the center of the die. Conversely, the worst
place is in a corner. Silicon itself is one of
the best thermal conductors, so centering a hotspot in the die gives it the best
possible position for heat spreading.
But when multiple hotspots exist, it's
poor practice to cluster them in the center, which effectively creates one large
hotspot. In such cases, it's best to distribute them relatively evenly over the die
while still avoiding the corners and/or
edges. So, each hotspot has a chance to
dissipate its heat evenly through the
medium of the substrate.
NEED MORE INFORMATION?
•Ansys
www.ansys.com
•Apache Design Solutions
www.apache-da.com
•ArchPro Design Automation
www.archpro-da.com
•Gradient Design Automation
www.gradient-da.com
•Magma Design Automation
www.magma-da.com
•Synopsys
www.synopsys.com
•Texas Instruments
www.ti.com
|