[Design View / Design Solution]
Formal Methodology Validates Cache-Coherence Protocol
Why wait until RTL is available to root the bugs out of a complex system design? Formal verification methods can be leveraged at the architectural level.
Norris Ip,
Scott Meeth,
Holly Stump
ED Online ID #21476
July 23, 2009
Copyright © 2006 Penton Media, Inc., All rights reserved. Printing of this document is for personal use only.
Reprints
Bugs in RTL code are problematic, but a bug in an architectural
specification can be catastrophic. If the bug
remains undetected until post-silicon debugging, the
design process essentially starts all over again.
Thus, it’s crucial to move the verification process as far forward
as possible. With that motivation in mind, engineers at Sun
Microsystems recently applied formal verification to an application
that commercial tools have not generally addressed—architectural-level protocol verification.
Sun’s engineers modeled a complex cache-coherence protocol,
with parallelism and multi-threading, for a high-performance
processor. They also used a commercial formal property-checking
tool from Jasper Design Automation to verify this protocol.
Initially, Sun engineers used Murphi, an academic tool developed
for protocol verification in the 1990s that includes a modeling
language and a state model checker. Although specifically
designed for architectural verification, Murphi lacked the capacity
to thoroughly verify the protocol. It soon became apparent that
for architectural verification, a commercial tool with much more
powerful underlying formal engines would be needed.
As Murphi ran into capacity limitations, Sun and Jasper engineers
worked together to convert the Murphi model to System-Verilog. They tuned the model for convergence in Jasper Design
Automation’s JasperGold, a process that reduced 129 state transition
rules to 17 rules without any loss of coverage.
Since JasperGold is typically used for RTL verification, it took
some work for Sun and Jasper engineers to build the SystemVerilog
model and tune it for architectural proof convergence. The
experience gained during this activity can be applied to other
architectural verification needs.
Once the necessary tuning was accomplished, designers were
able to verify the protocol, expose deep counter-examples, and
highlight micro-architectural details that needed to be added to the
architectural model. This process reduced the risk of an undetected
architectural bug that could have called for an expensive redesign.
It also made Sun engineers more confident about the protocol and
the soundness of the added micro-architectural details.
TAKING A FORMAL APPROACH
Architectural verification is in the critical path because the
consequences of an undetected architectural bug are so severe.
RTL verification won’t catch such a bug. It will simply show that
the RTL code matches the architectural spec. An architectural
specification bug may well remain undetected until post-silicon
debug, causing time-to-market delays, redesign, and possibly a
silicon respin.
Uncovering architectural problems early can prevent costly
redesigns downstream. For example, it’s not feasible to achieve
sufficient coverage at the design level for cache-coherence protocol
checking. That’s because there’s too much sequential depth.
To verify a cache-coherence protocol, a tool must consider a
range of traces that are both wide (in terms of starting and branching
points) and deep (with long sequences of events). That’s hard
enough with an architectural model and infeasible with a designlevel
model.
Formal verification can provide architects with an executable
specification that can be queried. Thus, architects can build a
model and query it in the form of property checks. For example,
an architect can ask the model to show a trace that has triggered a
worrisome mechanism. Architects can check for conditions such
as livelock or starvation that would be difficult to find with simulation.
If an architect or designer is unsure of something, such
as when it’s valid to check cache coherence, the answer can be
found by running a property check.
Formal architectural verification also makes it possible to tune
flow-control parameters, such as how many credits each agent
should have, and thus optimize design throughput. A technology
provided by Jasper lets architects and designers quickly generate
complex input scenarios from simple queries.
Formal verification also is a good way to find and check corner
cases, which are becoming more of a challenge for architectural
verification. With complex protocols, it becomes extremely difficult
to determine which corner cases are possible and how they
might be manifested.
Protocols such as cache coherence are good candidates for
formal verification because so many combinations of events
could occur. It would be difficult, if not impossible, to check all
possible combinations of inputs and outputs through simulation.
Moreover, cache coherence is an especially important protocol to
verify. A cache-coherence bug could result in a silent data corruption
that would disrupt system functionality.
A formal tool for high-level architectural and protocol applications
needs to be able to deal with challenges such as capacity,
design style, and verification flow. In this example, what made
Sun’s experience successful was the collaboration between its
engineers and Jasper’s to adapt the tool to a new application.
HOW ARCHITECTURAL AND DESIGN MODELS DIFFER
Most commercial verification tools were developed for designlevel
models. However, architectural models and design-level
models are fundamentally different. Design-level models have
much more overhead in terms of timing, area, and power. A specific example of design overhead is
staging flip-flops. Designers will often
add flops to long paths to meet timing.
This adds a state that’s not interesting
from an architectural perspective.
Another example of micro-architectural
overhead is pipelining, which adds artificial
sequential depth with additional
combinations of events, each of which is
indistinguishable at the architectural level
of abstraction.
In addition to having less overhead,
architectural models have much richer
state spaces. Thus, a much higher percentage
of state spaces of interest are
reachable, potentially as high as 50%
for an architectural model and as low as
5% to 10% for an RTL model. The low
percentage for RTL models is due to the
structural connectivity of the RTL, which
limits the number of states that can be
reached. In an architectural model, a rule
or property may involve a large number of
state variables, which calls for a more globalized
analysis compared to an RTL tool tasked with evaluating
the input and output of a logic gate.
At the design level, model checking has a large focus on trimming
out irrelevant logic. The focus is on mitigating the design
overhead and reducing the state space that needs to be explored.
At the architectural level, these techniques tend to be unhelpful
because of the negligible overhead. Instead, for the architectural
verification, model checkers should focus on efficiently exploring
a rich state space.
Continue to page 2
SystemVerilog is not usually thought of as an architectural
language, but it can be used for architectural modeling. The constructs
added to Verilog to facilitate testbenches and verification
are equally useful for architectural modeling. They facilitate a
higher level of abstraction through richer sets of types, userdefined
types, and type checking. SystemVerilog’s support for
inline assertions also helps. (One can write an architectural model
in plain Verilog as well, albeit with some difficulty, and that’s
what Sun was doing with earlier projects.)
A DIFFICULT PROTOCOL TO VERIFY
The basic cache coherence protocol is as follows:
• The source node issues a read request (Fig. 1a).
• The home node serializes (orders) and forwards requests (Fig.
1b).
• Nodes snoop and update their respective cache states, and then
they reply with a snoop response (Fig. 1c).
• When all snoop responses have arrived (Fig. 1c, again), the
source node aggregates them and notifies the home node that
t
can forward the next request to this address if there’s a request.
• At this point, the home node can forward the next request (Fig.
1d). Any data responses are delivered to the source node. Note
the overlap of the data stage of the previous request with the
next request. This is the complexity introduced by “relaxed serialization,”
which is described later in this section.
Writebacks are processed as follows:
• The source node issues a writeback request.
• The home node serializes and forwards writebacks along with
other coherent requests.
• The source node delivers writeback data to the home node.
• The home node writes data to memory and sends an acknowledgement
to the source node.
While it may appear conceptually simple, the protocol is challenging
to verify. One reason is that every node is multi-threaded.
Thus, many things can happen at the same time, making unexpected
behavior likely. For example, cache evictions may occur
when they’re not expected.
Multi-threading results in parallelism, and parallelism runs
counter to coherence, correctness, and tool capacity. With parallelism,
a robust protocol is mandatory to maintain coherence. As
demands increase for higher performance, advanced protocols are
now designed with higher parallelism and more relaxed memory
models, and therefore it’s harder to maintain correctness.
An important feature of this protocol, and one that adds to the
complexity of verification, is “relaxed serialization.” Serialization
is a process in which nodes agree on the order of events. If two
source nodes throw out simultaneous coherent requests for the
home node, the home node must serialize the requests and communicate
the ordering. Serialization takes place on a per-address
basis. It’s not necessary to control the parallelism of requests to
different addresses.
The simplest approach would be a strict serialization scheme
that processes only one forwarded request at a time. All other
requests would be stalled until the current one is completed. It
would be simple if the home node waited for full completion of
a request before forwarding the next one to the same address, but
that would come at a performance cost.
Under relaxed serialization, the next request can proceed when
all of the snoop responses for the previous request have arrived
at the source, even though there may be point-to-point data
responses from the previous request that weren’t yet received at
the source. It’s an efficient way to manage a cache protocol, but
it adds more parallelism, creates complex scenarios, and makes
proof convergence more difficult.
Complications also arise because each line shown in Figures
1a through 1d actually consists of three channels—Request,
Response, and Data. The activity in Figures 1a and 1b occurs
on the Request channel. The activity in Figure 1c takes place on
the Response channel, and the activity in Figure 1d takes place
on the Data channel. Figure 2 shows each channel feeding into a
separate buffer. The channels are separate and operate somewhat
independently, adding more parallelism to the protocol.
MODELING FOR CONVERGENCE
Sun’s engineers kept the protocol model as simple as possible
to ease the proof process. The model handled only one cache line
at a time, which sufficed because any coherent violation can be
described with only one address with arbitrary cache eviction.
One-bit data used by the model reduced the state space because
the protocol is independent of the memory address’ actual data.
The model employed two to four parameterized nodes. Two
nodes represented only a subset of possible behavior, and four
posed convergence problems, so engineers focused on modeling
three nodes. Finally, the model handled coherent requests only.
The coherence protocol uses the MOESI protocol, which
encompasses five possible states: M (modified), O (owned), E
(exclusive), S (shared), and I (invalid). These cache states are
reflected in these properties:
• Coherence: at most one node at a time has its line in M/O/E/WB_buff_vld.
• Memory: if no node has its line in M/O/WB_buff_vld, then a
valid copy in any node’s cache is clean.
• Consistency: pairs of nodes with valid cache states have equal
data values.
• Exclusivity: if a node has its line in E/M, then all other nodes
have lines in I; also, if a line is in E then it is clean.
WB_buff_vld is “writeback buffer valid.” When a line in M or
O is evicted, it goes to this state and stays there until it’s written
back or hit by another incoming request.
The initial coherence model was written in Murphi, which provides
an explicit state-model checker and a modeling language.
(Variants use a symbolic model checker and parallel engines, but
they were not actively used in this project.)
Users specify protocols as finite state machines (FSMs). Protocols
are captured in “rules,” which are conditions under which
state transitions can occur. Murphi provides invariants, in-line
assertions, and predicates.
Murphi found some counter-examples while debugging the
model, and in-line assertions found a bug in the model and an
unconsidered case. However, Murphi has a 32-bit implementation,
and capacity was a problem.
Continue to page 3
Engineers weren’t able to reach proof convergence with three
nodes. They could get convergence with two nodes only without
relaxed serialization. Another solution was clearly needed, driving
the move to SystemVerilog and JasperGold.
As it turns out, Jasper’s Norris Ip was the original implementer
of Murphi while he was at Stanford University from 1991 to
1996. He came up with a methodology to convert Sun’s Murphi
model into SystemVerilog. The main conceptual challenge had to
do with non-determinism.
Murphi introduces non-determinism by allowing for two different
rules with the same or overlapping guard or condition.
It’s able to choose either of the two rules to apply. Though
SystemVerilog doesn’t have the rule structure, it does enable
non-determinism via a module’s primary inputs. The ultimate
solution was to employ a Boolean vector input to select one rule
in SystemVerilog.
Murphi’s guarded rules, translated into if-then-else statements,
were ultimately represented in SystemVerilog “always” blocks.
A sequential block updates current state variables, and a combinational
block updates next state variables (Fig. 3). The guard for
each rule references the current state variables, and the action for
each rule updates the next state variables.
GOING FORMAL
Initial runs with JasperGold exposed deeper counter-examples
than Murphi, and these led to more details that were added to the
model. Engineers found that, for certain kinds of requests, a node
needs to “pretend” to send a snoop response to itself—and to stall
the snoop response when necessary.
Another counter-example showed that the home node needs to
include itself when ordering its memory responses. Yet another
showed that when an atomic read-modify-write request is issued
to a line in E (exclusive), the protocol needs to upgrade the line to
M (modifiable) when the request is ordered.
After tuning, formal verification uncovered even deeper
counter-examples. It exposed a property bug by finding flaws
in coherence checkpoints. JasperGold also showed that relaxed
serialization means that there’s really no time during which both
cache states and data are stable and in sync.
The solution was a kind of data forwarding. If copyback data
is en route to a node, the protocol “peeks ahead” to the incoming
values rather than waiting for the data to hit the cache. Data forwarding
was a modeling decision, not a protocol change.
While formal verification exposed deep counter-examples,
its deep traces compelled Sun engineers to pull in some microarchitectural
details. Evictions are one example. The architectural
spec didn’t cover evictions, so initially they were allowed at any
time, but this resulted in coherence violations. The engineers then
needed to define when dirty evictions are allowed.
For example, suppose node X has an invalidating read outstanding
to a cache line, and a snoop response has yet to arrive.
Another thread in node X wants to evict the entry from node X’s
cache. What happens if the eviction is allowed and a critical snoop
response then arrives? The critical snoop response indicates that
copyback data is on the way with changes that must be merged in.
This is the kind of detail that Sun added to the architectural model.
TUNING FOR CONVERGENCE
Several aspects of Sun’s cache-coherence protocol posed challenges
for proof convergence. Relaxed serialization resulted in
multiple orderings for events in sequences. Multi-threaded nodes
meant that evictions could happen at any time and that upgrades
can overlap with earlier reads to the cache line. In-order copyback
and memory responses had to be added to the model,
expanding the state space.
Applying formal verification to architectural verification
exposed a few idiosyncrasies in the tool version used at the time.
One convergence challenge arose when JasperGold synthesized
the “unique case” construct assuming that it models RTL for synthesis
into hardware.
Because there’s no intent to synthesize Sun’s architectural
model into hardware directly, it doesn’t follow typical RTL guidelines
such as timing requirements. Therefore, the “unique case”
construct that was modeling this protocol has many “cases.”
When a one-hot constraint was placed on a rule vector
because engineers were trying to get a one-hot multiplexer, they
instead ended up with a giant priority encoder with very deep
combinational logic. The workaround was to change the rules
vector to binary coded decimal (BCD). Another issue occurred
when the tool changed some index variables to latches. Sun engineers
were able to fix some of these manually.
Perhaps the main challenge was the sequential depth of the
model. A three-node model had 129 rules, making every state
branch out to 129 levels. The large number of rules was due largely
to Murphi’s practice of handling non-determinism by using
multiple rules to execute on the same architectural state. Because
many of the rules were simply different versions of the same rule,
it was possible to greatly reduce the rule count in SystemVerilog.
For example, many rules could be collapsed, including multiple
rules with identical guards, multiple rules with unique/full case
conditions, and multiple rules needed for non-determinism. Some
snoop responses could be collapsed, because the original model
had a rule for each node that returns a snoop response. Similarly,
data responses could be collapsed, because the original model had
one rule for copyback and another for memory data.
Continue to page 4
In the end, engineers were able to reduce 129 rules to 17 rules,
greatly reducing sequential depth. And JasperGold, with its multiple
formal engines, was able to converge on a proof for a threenode
system with relaxed serialization.
CONCLUSION
Sun’s experience shows that a commercial formal propertychecking
tool can be very adept at architecture-level protocol
checking. By using SystemVerilog and JasperGold, Sun’s engineers
exposed, clarified, and verified micro-architectural issues
that were added to the architectural specification. The main
benefit was increased confidence in the architecture and microarchitecture.
Sun is now leveraging this knowledge as it verifies the RTL
implementation of the coherence protocol. In addition to the
cache-coherence protocol discussed here, Sun already used formal
verification successfully to verify other protocol-related problems.
Architecture-level protocol verification is a new area for commercial
formal verification tools, and based on the experience
described here, it appears promising. Formal tool vendors need
to better optimize their tools to deal with architectural models, an
application currently being pursued by Jasper.
The authors would like to acknowledge Yoganand Chillarige of
Sun Microsystems.
SCOTT MEETH, design/architecture formal verification engineer at
Sun, received a BA in mathematics and a BS in computer science
from the University of Florida, Gainesville, and an MS in mathematics
and an MA in computer science from the University of Illinois at
Champaign-Urbana.
NORRIS IP, director of engineering at Jasper Design Automation,
received a PhD in computer science from Stanford University, Calif.,
and a master’s/bachelor’s degree in computer science from the
University of Oxford, United Kingdom.
HOLLY STUMP, vice president of marketing at Jasper Design,
received a BSEE from the Illinois Institute of Technology,Chicago.
|