The verification of today’s bleeding-edge chips requires the best methodology and tools, including the application of high-capacity formal verification technologies throughout the design flow, from architectural exploration to post-silicon debug. We see this last area, post-silicon debug, as an important value delivered by formal technology for design and verification teams who have not employed formal analysis earlier in the process to get the design right the first time. As case studies demonstrate, using formal analysis to find bugs, fix them, and verify the fixes adds tremendous value in the post-silicon lab.
To understand the stakes involved with post-silicon-bugs, consider these widely acknowledged consequences: Finding bugs in model testing is the least expensive option, as the cost of a bug goes up 10 times when it’s detected in component test, 10 times more in system test, and 10 times more if it gets into the field, which is, of course, the worst possible outcome for everyone involved.
Post-silicon debug can be very stressful. What can you do when the chip doesn’t work and time is running out? Missing the deadline can put one’s job—and one’s company—in jeopardy. Management expects these post-silicon-bugs to get isolated quickly and is regularly calling the design and debug team for updates.
Effort is spent trying to reproduce bugs seen in the lab using directed-random simulation and emulation. But these traditional approaches often run out of steam and cannot find the root cause of the bug fast enough. Formal verification has proved to be a lifesaver in these situations, for it uncovers the root causes of bugs and exhaustively validates fixes when other approaches have failed.
BRING ON THE BUGS
When the post-silicon debug team detects a problem in the chip under test, it is initially very abstract and not well understood. The chip hangs up, is non-responsive, drops packets, and/or sends out wrong output, among other functional problems. The first step is to know what is happening inside the chip.
Many chips today have some kind of on-chip trace extraction capability, which consists of means of freezing the chip when certain events are identified; on-chip logic analyzers that allow a selected group of signals to be multiplexed to external pins; the ability to save the value of certain signals some number of cycles before the freeze event; or scan chains to scan out all the flops. Given this, the post-silicon debug team can extract a failure trace, capturing a limited number of signals for a number of cycles before (and possibly after) a problem is detected.
The next step is to isolate and root-cause the post-silicon bug. At this point it is known that the chip is exhibiting illegal behavior, as it can be seen in the trace. But the trace represents the last N cycles of the run, and how this state was reached is not known. Typically, there will be a limited number of signals in the trace, and it is difficult to choose the right set of signals to show the problem.
This scenario is a classic test-lab dilemma (Fig. 1). The last few cycles of the failing scenario can be observed, but how can the root cause of the problem be found? How can the designer know in which block the bug is located?
USING DIRECTED-RANDOM SIMULATION
Traditionally, the directed-random simulation team is called in at this point to isolate the bug. Can it discover how this state was reached using simulation? Where the bug is happening isn’t clear, but it is known that it is causing block D to function incorrectly. The bug happens after a three- to four-hour run in the lab when a certain kind of traffic is injected—for instance, read transactions only on bus X.
In reality, finding the root cause with directed-random simulation will be, in most cases, like “Mission Impossible.” If it took four hours of real time with random traffic to hit the bug, how long will it take to reproduce it when simulation time is 1000 times slower? Four thousand hours of simulation? And can that be done in a week?
FORMAL TO THE RESCUE
Naturally, the right formal verification tool, with advanced capabilities, is required. These technologies include:
- High capacity for end-to-end proofs, making formal analysis practical for very large designs of hundreds of thousands of blocks
- The ability to prove the integrity of data transfers across a design, as offered in Jasper Formal Scoreboard and Proof Accelerators; multiple input and output ports make it possible to verify many different types of data transfers, such as transfers relying on byte enables and bus-width conversion, or serial-to-parallel conversion; Proof Accelerators for many common blocks increase performance and proof convergence
- Visualization for easy debug; the ability to easily refine the scenarios to match the lab
- Asynchronous clock handling
One of the key advantages of using proven formal verification technologies in post-silicon debug is their ability to find bugs fast. Usually, finding counter examples (CEX) with formal analysis is much faster than reaching full proofs on the same property, and this technique lends itself very well to bug hunting. A capable formal verification tool lets the user freeze a specific state in a specific cycle and then continues the analysis from that point, so it is unnecessary to analyze the entire design at once.
Finding the bug with formal analysis follows almost the same process as in a pre-silicon formal verification flow. There are, however, some significant differences:
- Search is for one specific bug and one specific scenario
- The process is not looking for full proof or coverage completeness
- The process just needs to find the scenario that leads to the illegal behavior
- Over-constraints are allowed to simplify the process—for example, disallow write transactions because the bug only happens with read transactions
In a typical application of formal verification, the team would start with what is known. A trace already shows the illegal scenario, and the problem is known to happen when a read transaction is followed by another read transaction. All that needs be done then is define a property stating - not (illegal_scenario).
In the process, it is unnecessary to constrain the design as in a normal formal run, thereby simplifying the process and allowing for additional shortcuts. For example, it is acceptable to over-constrain the design and add the constraint “no_write_transaction_allowed” since it is known the bug happened in “read” and not in “write” transactions.