Formal Methodology Validates Cache-Coherence Protocol

Bugs in RTL code are problematic, but a bug in an architectural specification can be catastrophic. If the bug remains undetected until post-silicon debugging, the design process essentially starts all over again.

Thus, it's crucial to move the verification process as far forward as possible. With that motivation in mind, engineers at Sun Microsystems recently applied formal verification to an application that commercial tools have not generally addressed—architectural-level protocol verification.

Sun's engineers modeled a complex cache-coherence protocol, with parallelism and multi-threading, for a high-performance processor. They also used a commercial formal property-checking tool from Jasper Design Automation to verify this protocol. Initially, Sun engineers used Murphi, an academic tool developed for protocol verification in the 1990s that includes a modeling language and a state model checker. Although specifically designed for architectural verification, Murphi lacked the capacity to thoroughly verify the protocol. It soon became apparent that for architectural verification, a commercial tool with much more powerful underlying formal engines would be needed.

As Murphi ran into capacity limitations, Sun and Jasper engineers worked together to convert the Murphi model to System-Verilog. They tuned the model for convergence in Jasper Design Automation's JasperGold, a process that reduced 129 state transition rules to 17 rules without any loss of coverage.

Since JasperGold is typically used for RTL verification, it took some work for Sun and Jasper engineers to build the SystemVerilog model and tune it for architectural proof convergence. The experience gained during this activity can be applied to other architectural verification needs.

Once the necessary tuning was accomplished, designers were able to verify the protocol, expose deep counter-examples, and highlight micro-architectural details that needed to be added to the architectural model. This process reduced the risk of an undetected architectural bug that could have called for an expensive redesign. It also made Sun engineers more confident about the protocol and the soundness of the added micro-architectural details.

TAKING A FORMAL APPROACH
Architectural verification is in the critical path because the consequences of an undetected architectural bug are so severe. RTL verification won't catch such a bug. It will simply show that the RTL code matches the architectural spec. An architectural specification bug may well remain undetected until post-silicon debug, causing time-to-market delays, redesign, and possibly a silicon respin.

Uncovering architectural problems early can prevent costly redesigns downstream. For example, it's not feasible to achieve sufficient coverage at the design level for cache-coherence protocol checking. That's because there's too much sequential depth. To verify a cache-coherence protocol, a tool must consider a range of traces that are both wide (in terms of starting and branching points) and deep (with long sequences of events). That's hard enough with an architectural model and infeasible with a designlevel model.

Formal verification can provide architects with an executable specification that can be queried. Thus, architects can build a model and query it in the form of property checks. For example, an architect can ask the model to show a trace that has triggered a worrisome mechanism. Architects can check for conditions such as livelock or starvation that would be difficult to find with simulation. If an architect or designer is unsure of something, such as when it's valid to check cache coherence, the answer can be found by running a property check.

Formal architectural verification also makes it possible to tune flow-control parameters, such as how many credits each agent should have, and thus optimize design throughput. A technology provided by Jasper lets architects and designers quickly generate complex input scenarios from simple queries.

Formal verification also is a good way to find and check corner cases, which are becoming more of a challenge for architectural verification. With complex protocols, it becomes extremely difficult to determine which corner cases are possible and how they might be manifested.

Protocols such as cache coherence are good candidates for formal verification because so many combinations of events could occur. It would be difficult, if not impossible, to check all possible combinations of inputs and outputs through simulation. Moreover, cache coherence is an especially important protocol to verify. A cache-coherence bug could result in a silent data corruption that would disrupt system functionality.

A formal tool for high-level architectural and protocol applications needs to be able to deal with challenges such as capacity, design style, and verification flow. In this example, what made Sun's experience successful was the collaboration between its engineers and Jasper's to adapt the tool to a new application.

HOW ARCHITECTURAL AND DESIGN MODELS DIFFER
Most commercial verification tools were developed for designlevel models. However, architectural models and design-level models are fundamentally different. Design-level models have much more overhead in terms of timing, area, and power. A specific example of design overhead is staging flip-flops. Designers will often add flops to long paths to meet timing. This adds a state that's not interesting from an architectural perspective. Another example of micro-architectural overhead is pipelining, which adds artificial sequential depth with additional combinations of events, each of which is indistinguishable at the architectural level of abstraction.

In addition to having less overhead, architectural models have much richer state spaces. Thus, a much higher percentage of state spaces of interest are reachable, potentially as high as 50% for an architectural model and as low as 5% to 10% for an RTL model. The low percentage for RTL models is due to the structural connectivity of the RTL, which limits the number of states that can be reached. In an architectural model, a rule or property may involve a large number of state variables, which calls for a more globalized analysis compared to an RTL tool tasked with evaluating the input and output of a logic gate.

At the design level, model checking has a large focus on trimming out irrelevant logic. The focus is on mitigating the design overhead and reducing the state space that needs to be explored. At the architectural level, these techniques tend to be unhelpful because of the negligible overhead. Instead, for the architectural verification, model checkers should focus on efficiently exploring a rich state space.

SystemVerilog is not usually thought of as an architectural language, but it can be used for architectural modeling. The constructs added to Verilog to facilitate testbenches and verification are equally useful for architectural modeling. They facilitate a higher level of abstraction through richer sets of types, userdefined types, and type checking. SystemVerilog's support for inline assertions also helps. (One can write an architectural model in plain Verilog as well, albeit with some difficulty, and that's what Sun was doing with earlier projects.)

A DIFFICULT PROTOCOL TO VERIFY
The basic cache coherence protocol is as follows:
• The source node issues a read request (Fig. 1a).
• The home node serializes (orders) and forwards requests (Fig. 1b).
• Nodes snoop and update their respective cache states, and then they reply with a snoop response (Fig. 1c).
• When all snoop responses have arrived (Fig. 1c, again), the source node aggregates them and notifies the home node that
t can forward the next request to this address if there's a request.

• At this point, the home node can forward the next request (Fig. 1d). Any data responses are delivered to the source node. Note the overlap of the data stage of the previous request with the next request. This is the complexity introduced by “relaxed serialization,” which is described later in this section.

Writebacks are processed as follows:
• The source node issues a writeback request.
• The home node serializes and forwards writebacks along with other coherent requests.
• The source node delivers writeback data to the home node.
• The home node writes data to memory and sends an acknowledgement to the source node.

While it may appear conceptually simple, the protocol is challenging to verify. One reason is that every node is multi-threaded. Thus, many things can happen at the same time, making unexpected behavior likely. For example, cache evictions may occur when they're not expected.

Multi-threading results in parallelism, and parallelism runs counter to coherence, correctness, and tool capacity. With parallelism, a robust protocol is mandatory to maintain coherence. As demands increase for higher performance, advanced protocols are now designed with higher parallelism and more relaxed memory models, and therefore it's harder to maintain correctness.

An important feature of this protocol, and one that adds to the complexity of verification, is “relaxed serialization.” Serialization is a process in which nodes agree on the order of events. If two source nodes throw out simultaneous coherent requests for the home node, the home node must serialize the requests and communicate the ordering. Serialization takes place on a per-address basis. It's not necessary to control the parallelism of requests to different addresses.

The simplest approach would be a strict serialization scheme that processes only one forwarded request at a time. All other requests would be stalled until the current one is completed. It would be simple if the home node waited for full completion of a request before forwarding the next one to the same address, but that would come at a performance cost.

Under relaxed serialization, the next request can proceed when all of the snoop responses for the previous request have arrived at the source, even though there may be point-to-point data responses from the previous request that weren't yet received at the source. It's an efficient way to manage a cache protocol, but it adds more parallelism, creates complex scenarios, and makes proof convergence more difficult.

Complications also arise because each line shown in Figures 1a through 1d actually consists of three channels—Request, Response, and Data. The activity in Figures 1a and 1b occurs on the Request channel. The activity in Figure 1c takes place on the Response channel, and the activity in Figure 1d takes place on the Data channel. Figure 2 shows each channel feeding into a separate buffer. The channels are separate and operate somewhat independently, adding more parallelism to the protocol.

MODELING FOR CONVERGENCE
Sun's engineers kept the protocol model as simple as possible to ease the proof process. The model handled only one cache line at a time, which sufficed because any coherent violation can be described with only one address with arbitrary cache eviction. One-bit data used by the model reduced the state space because the protocol is independent of the memory address' actual data.

The model employed two to four parameterized nodes. Two nodes represented only a subset of possible behavior, and four posed convergence problems, so engineers focused on modeling three nodes. Finally, the model handled coherent requests only.

The coherence protocol uses the MOESI protocol, which encompasses five possible states: M (modified), O (owned), E (exclusive), S (shared), and I (invalid). These cache states are reflected in these properties:
• Coherence: at most one node at a time has its line in M/O/E/WB_buff_vld.
• Memory: if no node has its line in M/O/WB_buff_vld, then a valid copy in any node's cache is clean.
• Consistency: pairs of nodes with valid cache states have equal data values.
• Exclusivity: if a node has its line in E/M, then all other nodes have lines in I; also, if a line is in E then it is clean.

WB_buff_vld is “writeback buffer valid.” When a line in M or O is evicted, it goes to this state and stays there until it's written back or hit by another incoming request.

The initial coherence model was written in Murphi, which provides an explicit state-model checker and a modeling language. (Variants use a symbolic model checker and parallel engines, but they were not actively used in this project.)

Users specify protocols as finite state machines (FSMs). Protocols are captured in “rules,” which are conditions under which state transitions can occur. Murphi provides invariants, in-line assertions, and predicates.

Murphi found some counter-examples while debugging the model, and in-line assertions found a bug in the model and an unconsidered case. However, Murphi has a 32-bit implementation, and capacity was a problem.

Engineers weren't able to reach proof convergence with three nodes. They could get convergence with two nodes only without relaxed serialization. Another solution was clearly needed, driving the move to SystemVerilog and JasperGold.

As it turns out, Jasper's Norris Ip was the original implementer of Murphi while he was at Stanford University from 1991 to 1996. He came up with a methodology to convert Sun's Murphi model into SystemVerilog. The main conceptual challenge had to do with non-determinism.

Murphi introduces non-determinism by allowing for two different rules with the same or overlapping guard or condition. It's able to choose either of the two rules to apply. Though SystemVerilog doesn't have the rule structure, it does enable non-determinism via a module's primary inputs. The ultimate solution was to employ a Boolean vector input to select one rule in SystemVerilog.

Murphi's guarded rules, translated into if-then-else statements, were ultimately represented in SystemVerilog “always” blocks. A sequential block updates current state variables, and a combinational block updates next state variables (Fig. 3). The guard for each rule references the current state variables, and the action for each rule updates the next state variables.

GOING FORMAL
Initial runs with JasperGold exposed deeper counter-examples than Murphi, and these led to more details that were added to the model. Engineers found that, for certain kinds of requests, a node needs to “pretend” to send a snoop response to itself—and to stall the snoop response when necessary.

Another counter-example showed that the home node needs to include itself when ordering its memory responses. Yet another showed that when an atomic read-modify-write request is issued to a line in E (exclusive), the protocol needs to upgrade the line to M (modifiable) when the request is ordered.

After tuning, formal verification uncovered even deeper counter-examples. It exposed a property bug by finding flaws in coherence checkpoints. JasperGold also showed that relaxed serialization means that there's really no time during which both cache states and data are stable and in sync.

The solution was a kind of data forwarding. If copyback data is en route to a node, the protocol “peeks ahead” to the incoming values rather than waiting for the data to hit the cache. Data forwarding was a modeling decision, not a protocol change.

While formal verification exposed deep counter-examples, its deep traces compelled Sun engineers to pull in some microarchitectural details. Evictions are one example. The architectural spec didn't cover evictions, so initially they were allowed at any time, but this resulted in coherence violations. The engineers then needed to define when dirty evictions are allowed.

For example, suppose node X has an invalidating read outstanding to a cache line, and a snoop response has yet to arrive. Another thread in node X wants to evict the entry from node X's cache. What happens if the eviction is allowed and a critical snoop response then arrives? The critical snoop response indicates that copyback data is on the way with changes that must be merged in. This is the kind of detail that Sun added to the architectural model.

TUNING FOR CONVERGENCE
Several aspects of Sun's cache-coherence protocol posed challenges for proof convergence. Relaxed serialization resulted in multiple orderings for events in sequences. Multi-threaded nodes meant that evictions could happen at any time and that upgrades can overlap with earlier reads to the cache line. In-order copyback and memory responses had to be added to the model, expanding the state space.

Applying formal verification to architectural verification exposed a few idiosyncrasies in the tool version used at the time. One convergence challenge arose when JasperGold synthesized the “unique case” construct assuming that it models RTL for synthesis into hardware.

Because there's no intent to synthesize Sun's architectural model into hardware directly, it doesn't follow typical RTL guidelines such as timing requirements. Therefore, the “unique case” construct that was modeling this protocol has many “cases.”

When a one-hot constraint was placed on a rule vector because engineers were trying to get a one-hot multiplexer, they instead ended up with a giant priority encoder with very deep combinational logic. The workaround was to change the rules vector to binary coded decimal (BCD). Another issue occurred when the tool changed some index variables to latches. Sun engineers were able to fix some of these manually.

Perhaps the main challenge was the sequential depth of the model. A three-node model had 129 rules, making every state branch out to 129 levels. The large number of rules was due largely to Murphi's practice of handling non-determinism by using multiple rules to execute on the same architectural state. Because many of the rules were simply different versions of the same rule, it was possible to greatly reduce the rule count in SystemVerilog.

For example, many rules could be collapsed, including multiple rules with identical guards, multiple rules with unique/full case conditions, and multiple rules needed for non-determinism. Some snoop responses could be collapsed, because the original model had a rule for each node that returns a snoop response. Similarly, data responses could be collapsed, because the original model had one rule for copyback and another for memory data.

In the end, engineers were able to reduce 129 rules to 17 rules, greatly reducing sequential depth. And JasperGold, with its multiple formal engines, was able to converge on a proof for a threenode system with relaxed serialization.

CONCLUSION
Sun's experience shows that a commercial formal propertychecking tool can be very adept at architecture-level protocol checking. By using SystemVerilog and JasperGold, Sun's engineers exposed, clarified, and verified micro-architectural issues that were added to the architectural specification. The main benefit was increased confidence in the architecture and microarchitecture.

Sun is now leveraging this knowledge as it verifies the RTL implementation of the coherence protocol. In addition to the cache-coherence protocol discussed here, Sun already used formal verification successfully to verify other protocol-related problems.

Architecture-level protocol verification is a new area for commercial formal verification tools, and based on the experience described here, it appears promising. Formal tool vendors need to better optimize their tools to deal with architectural models, an application currently being pursued by Jasper.

The authors would like to acknowledge Yoganand Chillarige of Sun Microsystems.

SCOTT MEETH, design/architecture formal verification engineer at Sun, received a BA in mathematics and a BS in computer science from the University of Florida, Gainesville, and an MS in mathematics and an MA in computer science from the University of Illinois at Champaign-Urbana.

NORRIS IP, director of engineering at Jasper Design Automation, received a PhD in computer science from Stanford University, Calif., and a master's/bachelor's degree in computer science from the University of Oxford, United Kingdom.

HOLLY STUMP, vice president of marketing at Jasper Design, received a BSEE from the Illinois Institute of Technology,Chicago.