Deep-submicron systems-on-a-chip (SoCs) require a power-grid voltage drop of much less than 10% of VDD. Decoupling capacitors, or decaps, help achieve this goal by minimizing switching noise. Determining the amount of decap required for an SoC involves many considerations, but the task needn’t be a chore. The approach described in this article allows you to allocate decap accurately and with minimal area overhead for deep-submicron (DSM) SoCs.

The method detailed here uses uniform placement (rather than structured relative placement) because the uniform approach has a smaller impact on congestion (both routing and placement) and can be part of a standard flow that doesn’t require custom structures. In fact, this method is now part of the design flow within the Synopsys Pilot Design Environment, implemented with Tcl scripts for floorplanning, incremental decap insertion, dynamic rail analysis, and debug.

Decap method overview
A key feature of the method is a phased approach with early estimation and a simple decap insertion implementation. Tests with a number of designs implemented by Synopsys Professional Services have shown that the best results are obtained if decaps are initially placed at the floorplanning stage. This approach enables you to make post-placement corrections easily if dynamic rail analysis indicates a need for additional decap cells.

Note that dynamic rail analysis is essential for designs at 90 nm or smaller. Static analysis provides a picture of voltage-drop problems, but it’s incomplete for DSM designs.

The analysis described in this article shows that you need to allocate between 7% and 10% of design area to decap cells. That may seem excessive, but at 90 nm and below, dynamic and static voltage drop is a significant problem.

Industry examples and rail analysis indicate that deep-submicron SoCs can suffer from performance issues and even functional failures without sufficient decap, so very high-performance designs can require that as much 20% of the placement area be dedicated to decoupling capacitors. The exact value for decap density varies according to each major block’s dynamic power density, since dynamic power density is proportional to operating frequency f.

For clarity, this article describes the decap method as it applies to an example SoC. The method applies equally well to a range of high-performance and handheld/mobile designs.

Estimating decap
For worst-case planning purposes, assume that decap comes entirely from dedicated Cox (gate-oxide capacitance) sources. Other intrinsic sources, such as inactive gate-junction capacitances and wire capacitance between adjacent VDD and VSS power grid nets, are ignored. A less pessimistic but reasonable assumption would be for 25% (in certain cases, as much as 50%) of the decoupling capacitance to come from intrinsic sources. In our case, we assumed 0% for an absolute worst-case estimate. At the block level, a slight refinement improves estimations: assume at least 25% of the decap comes from non-dedicated Cox sources.

For accurate analysis of decap requirements, you need good estimates of dynamic power consumption. A practical approach for getting these estimates at early design stages is to characterize the design for power based on published results for similar designs. The International Technology Roadmap for Semiconductors (ITRS) is helpful for this purpose (see the table). The table also highlights key technology parameters for copper interconnects.

From the ITRS table, choose a design whose characteristics are closest to your target design. By comparing the ITRS design characteristics with those of the target design, you can calculate a baseline decap requirement. To further refine your model, you can scale values for VDD and frequency to appropriate values.

For the purpose of illustrating recommended decap estimation and insertion techniques, we reference an example design throughout the following discussion. Our example design used the ARM1176 core running at a 350-MHz nominal clock frequency.

Using the method described here, we estimated power consumption of 104 mW at VDD = 1.0 V under typical operating conditions. With an estimated physical area of 4.989 mm2, this module clearly possesses a higher power density (and thus higher required decap density) compared to our example design’s video-output sub-module with its 267-MHz maximum clock frequency and area of 11.630 mm2.

For the example design’s core, the estimation method indicates a worst-case power density of approximately 4 W/cm2. Initial estimates indicate that the core requires a total of 25.680 nF of dedicated Cox decap.

Uniform distribution
In the uniform distribution method, decap cells are placed throughout the design as normal standard cells between row power and ground rails (Fig. 1). The decaps have a statistically uniform distribution with the pre-determined percentage of the cell area. Note, however, that this is a uniform random distribution based on a percentage of the total available floorplan area that results in an irregular placement pattern (in contrast to a regular grid as in Figure 1).

This approach has a two-step process. The first step is to pre-place decaps (before placing the standard cells) using uniform distribution with 6% average recommended area allocation for decap cells. We fix the placement of these cells so that they’re not removed at subsequent design stages.

The second step is to perform post-route incremental decap placement based on the results of intermediate power-density analysis and voltage-drop analysis, budgeting 1% to 4% of the area. This step, which is usually done at the engineering-change-order (ECO) cell and filler insertion stage, may actually remove some decaps in low-activity areas of the design.

With the above as a guide, the worst-case decap area allocation on a global basis would be between 7% and 10% of the core physical area. Bear in mind that because of the dynamic nature of power optimization and analysis, this process is an iterative one; these numbers may require adjustment on the way to design closure.

After early estimation and planning, the flow proceeds to implementation and refinement, which consists mainly of estimating the number of “gate-array” decap cells you need in addition to the floorplanning-based decap cells. These special gate-array ECO cells can be wired as decaps when not required for an ECO. This approach is an alternative to a conventional spare-gate insertion methodology.

The final signoff phase of the decap flow requires detailed dynamic voltage-drop analysis with one or more vector-based activity (VCD) files from gate-level simulations. A dynamic rail-analysis tool such as Synopsys’ PrimeRail should be used for this purpose. The analysis scope is limited to the digital core (top level plus all underlying major blocks).

It’s also advisable to use a static analysis tool such as Synopsys’ AstroRail to verify power-grid integrity and static voltage drop. The PrimePower-to-AstroRail and PrimePower-to-PrimeRail interfaces, which use binary files, allowed the use of instance-based power information. Although a PrimePower-to-PrimeRail link was used for this example, a newer PrimeTime-PX-to-PrimeRail link is now in place.

It’s sometimes difficult, time-consuming, and disk-space intensive to generate the necessary VCD files. As an alternative, you can employ the same statistical switching estimates used to constrain the design through synthesis and dynamic rail analysis. The caveat is that the results are usually pessimistic. More pessimism can be introduced with clock-gated designs, so this should be taken into account.

Initially, you can verify the design using A/B comparisons, with the goal of showing the relative improvement in worst-case voltage droop. This approach also allows you to determine if the tool is producing the expected results or if you need to make general setup/configuration changes. After reaching that milestone, you can use the results to determine whether you have to modify decap density estimates or obtain additional characterization data.

Insert decaps at block and top level
Although planning is done at the top level and a uniform distribution is assumed to simplify calculations, it’s best to begin decap insertion at the block level. Each block has a uniform decap distribution, but the decap densities vary from block to block. This non-uniform distribution across the die (when viewed from the top level) satisfies block-level power-density requirements. The same script and algorithm used for decap insertion at the block level can be used at the top level (when the algorithm is configured for use with rectilinear placement regions).

Top-level decap insertion is similar to inserting decaps at the block level. Adjust the decap density to reflect the estimated voltage drop requirements at the top level. Decaps required by the coupling of blocks at the top level are also analyzed and fixed with a top-level run.

Implementation details
You can use the native Milkyway axgSpreadGroupCells command to uniformly distribute decaps. Distances between the caps vary after the subsequent placement legalization. Decap insertion is done just after the detailed power-grid information is pushed down from the top-level design.

Invoke the insertion via the following gmake target:

## --------------------------------------------------------------
## Decoupling Capacitance Insertion – target=dcap_insertion
## --------------------------------------------------------------
$(LOG_DIR)/045_dcap_insertion/decoupling_cap.pass: \
@rm -rf $@
@$(MAKE_CMD) jxttcl \
GEV_SRC=040_power_insertion \
GEV_DST=045_dcap_insertion \
GEV_SCRIPT=$(GEV_GSCRIPT_DIR)/fp/decoupling_cap.tcl \
-must_not_have 'Error:' \
-must_not_have 'ERROR'

This example specifies the DCAPHVT32 cell as the decap cell master and tells the script to allocate 5% of the placeable area to these decap cells. For clarification, “placeable area” comprises a rectilinear placement area and excludes the area of the macros, blockages, preroutes, and other fixed cells.

Consider an example generic implementation that requires 5% of the core area to be reserved for decaps. To make the example simple, assume a rectangular floorplan, use only DCAP32 cells, and assume a hypothetical case in which the decaps are placed in regular vertical columns and have a pitch that matches that of the Metal 6 vertical straps. Such a distribution is shown in Figure 1.

For the example design, a total of 134,726 cells were placed uniformly to supply 20.519 nF of dedicated decap. Increasing the allocation to 8% using this same cell would increase the total decap to 32.830 nF. For this implementation example, the total leakage-power contribution of the decap cells ranges from 16.900 to 27.040 mW, which is less than 1% of the 3-W core power budget. Bear in mind that the total leakage of the decap cells varies exponentially with temperature, so some reallocation may be required.

The decaps can be inserted at the block level using Tcl scripts for Synopsys’ Jupiter-XT floorplanner (scripts are available free of charge from Synopsys Professional Services). You can also obtain an example script that runs in Astro or Jupiter for incremental post-route refinement. Implementation results for part of the example design can be seen with the decap cells highlighted in red (Fig. 2).

Analysis and verification
Though verification of the example design revealed no major issues with the decap flow, many voltage-drop issues were uncovered and fixed. Three of these issues are worth consideration here.

In one case, two large clock buffers in the same clock tree were switching almost simultaneously, and they caused a droop in VDD (Fig. 3). The buffers straddled a horizontal Metal-1 VDD row strap. Adding placement halos around these buffers created an additional minimum space between them and ensured that adjacent clock buffers didn’t straddle a rail as they were placed. This change also left room for decaps to be placed automatically at the floorplanning stage. The fix prevented the problem from occurring during initial placement, clock-tree synthesis, and even post-placement optimization. Alternatively, we could have limited the possible orientations on the clock-buffer cells to prevent placing them on adjacent rows.

The second voltage-drop issue occurred in the example design’s largest high-performance block. This block, which had a maximum frequency of 377 MHz, had well over 500,000 placeable instances, including 85 memories, as well as many clock domains. The VDD dynamic voltage-drop analysis showed a hot spot localized to Metal-1 row straps between tightly placed RAMs. The problem resulted from the absence of decap cells between memories. A hard placement blockage in the floorplan prevented decaps from being placed in this area (Fig. 4).

This problem could have been fixed by manually placing decaps in an incremental ECO and placement step. The problem revealed a weakness in the flow, however, so we revised the flow to remove the hard placement blockage just before the decap insertion step, and then replaced the blockage immediately thereafter.

The video-output stage was the largest block in the example design, consisting of almost 600,000 placeable instances, including 62 memories, for a total equivalent gate count of about 2.6 Mgates. This block’s maximum frequency was relatively low and the clock structure considerably simpler than that of the block described previously.

This block exhibited a different set of voltage-drop problems (Fig. 5). Note the very small hot spot to the upper left. At first, the problem seemed odd, since the decap distribution looked reasonable.

Further investigation revealed a problem relating to a set of I/O-isolation buffers placed on the left side of the design between two closely placed memories. The buffers with the worst voltage-drop characteristics were connected to input ports that had relatively high switching activity. In many cases, these high-activity inputs had multiple back-to-back isolation buffers. The buffers were tightly clumped, preventing decap cells from being placed amongst them (Fig. 6). A hard placement blockage to the left of some RAMs made the clumping worse.

A complicating factor was that the placement flow fixed the placement of the isolation buffers prior to decap cell placement. This problem could have been corrected using decap cells, but the fixed-placement nature of the cells prevented this.

For this situation, we spread the RAMs apart slightly and removed the hard placement blockage next to the RAMs so that the isolation buffers could be placed in a more optimal fashion. Spreading the memories apart also allowed isolation buffers to be placed between them.

In summary, the proposed decap flow worked well for the example design (and several other designs). At the same time, this flow can’t solve all voltage-drop issues, so accurate and timely analysis is essential.

While analyzing and debugging dynamic voltage-drop issues is both an interactive and visual process, we found areas where Tcl procedures can help. These scripts primarily speed up the generation of voltage-drop maps and quickly identify the quality of the decap distribution.

Summary and recommendations
The decap implementation method described here is designed to allocate decap accurately with minimal related area overhead and avoid late problems that jeopardize the tape-out schedule. The method uses several “best practices” to achieve this goal.

For example, initial decap estimates are based on industry experience and guidelines from recent ASICs. A phased approach with early estimation and implementation of decap insertion is also a key feature. The best results are obtained if the decoupling capacitors are initially placed at the floorplanning stage. Remember, too, that dynamic rail analysis is essential for 90-nm or smaller designs.

You must plan for and implement enough decoupling capacitance as early as possible to ensure a working and on-schedule ASIC. The method described in this article can help you meet decap requirements with a reasonable effort.