[Design View / Design Solution]
Master On-Chip Embedded Multiprocessor Coherence
Although snoopy virtual-bus approaches are the first step, hybrid snoopy-directory schemes will be the next trend in embedded coherence.
We only will mention the common way of classifying coherence protocols. This classification is based on the stable states of the caches in the system. The common states are referred to as "MOESI": Modified, Owned, Exclusive clean, Shared clean, and Invalid. The terms are self-explanatory, and details are readily available in textbooks.4
Related to state-based protocol classification is whether the protocol is update- or invalidate-based. In an invalidate-based coherence protocol, the invariant maintained in the system is that only a single owner of a cache line exists in the system. In an update-based system, all copies of the cache line are updated on a Write.
Serialization Many older symmetric-multiprocessing (SMP) (non-CMP) systems used a bus to broadcast transactions to all agents in the system. Therefore, the agents could "snoop" their state and then take the proper actions to invalidate and update their copies of the data item. The overlap between the different phases of a transaction was minimal and restricted to in-order slip (pipelining).
But for reasons of bandwidth scalability, limitation of speed, and scalability of buses, these rigid snoopy schemes evolved to a couple of newer coherence schemes. At the high end (but still relevant for embedded CMPs, albeit for different reasons), directory-based schemes are common. When there's a low degree of multiprocessing, snoopy "virtual bus" schemes often are the preferred routes.
Snoopy virtual-bus serialization uses specialized higher-performance interconnects, especially in the request phase of a transaction, such as a tree of switches or hierarchical rings (Fig. 2). In these systems, the interconnect is responsible for creating the global serial order while moving from a limiting physical-bus-based interconnect to higher-performance (e.g., serial) point-to-point signaling links.
Directory-based schemes,5 on the other hand, perform the serialization at a new construct called a directory. This directory, which usually resides in the memory module, holds the state of the various cache lines in the system. In general, these systems are a great deal less dependent on the network for serialization and ordering compared to snoopy schemes (virtual or otherwise). Because the number of messages isn't broadcast in directory schemes, they can scale to much larger systems.
Another trend affecting on-chip coherence is that next-generation SoCs (with multiple processors) are following a methodology of separating communications from computation, for reasons of complexity mitigation. This has resulted in design methodologies based on networks-on-a-chip (NoCs),6 and the movement from circuit-switched to packet-switched NoCs.7 Any on-chip coherence scheme needs to heed this important move in deep-submicron SoCs and layer the coherence protocol on a packet-switched substrate.
Embedded SoCs have added issues with cost, low power, real-time operation, intellectual-property (IP) ownership, and possibly heterogeneous processors. Consequently, selecting the coherence scheme is a bit different from their general-purpose counterparts. Low power results in lower system cost, which is a sensitive factor for SoCs. Moreover, if a SoC is used in a mobile application, low power certainly becomes a necessity.
Just as it took a while for caches to break into the DSP world (cycle-accurate processor and system simulators were the key tools that helped accelerate this transition), the same is true for coherence. To port software to a real-time system, a coherence/SoC designer must ensure that a sufficiently cycle-approximate (and fast) simulator is available for the application/middleware port. The problem is a bit more severe in high-performance embedded SoCs, since programmers are exposed to the hardware more than in a general-purpose multiprocessor. In the latter, a restricted set of "system" (middleware, libs, operating system) programmers are exposed to this interface.
IP ownership is a unique feature of embedded SoCs. Most general-purpose CMP vendors' designs don't incorporate any outside IP at the memory-bus level (the level at which coherence is relevant). But outside IP is routine for an embedded-SoC integrator, so much so that even the interconnect (e.g., OCP-IP)8 in many high-performance embedded SoCs is an IP block acquired from an outside IP vendor. Moreover, a high-performance embedded SoC could sometimes benefit from heterogeneous ISA cores sharing the same memory coherently (say, a RISC core and a DSP).
Looking at these trends, the relevance of snoopy virtual-bus coherence schemes to CMPs should be obvious: limited scalability, lots of on-chip bandwidth, point-to-point signaling, less overhead, and low latency. But it's interesting that directory schemes, which are generally considered as applicable only to large server-class machines, are also relevant to embedded SoCs (with possible modification). That's because they can work with unordered interconnects, heterogeneous ISAs, lower-power unicast transactions, etc.
While the first generations of embedded CMPs may opt for just a snoopy virtual-bus scheme, it is predicted that more interesting hybrid snoopy-directory schemes may be the next trend in embedded coherence. That's because designers will come to appreciate the modularity benefits of directory-based schemes.