Cache coherency and bus or NoC traffic must be verified by executing real software.
Blessing a system-on-a-chip (SoC) for tape-out is all about being able to sleep nights knowing that the actual silicon will work. Getting to that point is all about lots of sleepless nights trying to build that confidence.
A “simple” single-core SoC, which is typically anything but simple, provides plenty of verification challenge for the simple reason that lots of software must be checked out before you can say the system is good to go. Often that software executes over Linux, meaning you have to run Linux to run the application, meaning billions of clocks cycles of execution just to set up a test.
Emulators play an important role in that scenario as the only way to perform cycle-accurate verification of software. So imagine the situation, then, where you need to validate a multicore system.
Two interacting cores are more than twice the complexity of a single core. While multicore has become commonplace on desktop computers, it’s now an inescapable reality in multicore systems as well. In fact, it’s probably fair to say that embedded multicore is more complex than desktop multicore—in some cases, by a wide margin.
Three Waves Of Multicore
There have been, roughly speaking, three generations of embedded multicore. The first was pioneered for packet processing. Because they were a lonely and specialized group, there was no real common infrastructure, and they toiled away in relative obscurity, developing their own architectures and methodologies focused with laser-like precision on dispositioning packets as quickly as possible.
The architectures they created were typically pipelines of cores, sometimes simple cores, often running bare metal with run-to-completion programs that were handcrafted to get them to fit. Such structures are moderately simple in terms of the regularity of the structure and the no-frills approach, but can be complex due to the extraordinary measures taken to optimize speed.
The next obvious wave has occurred in the mobile market, most notably in the smart phone. Because of the rapidly increasing expectations placed on phones, the “easy” way to approach it was to dedicate different processors for different functions. You may have one application processor handling high-level stuff, but then there may be video processors or baseband processors or other units that handle various specific tasks.
This kind of heterogeneous environment is complex in that it’s irregular. But because, for the most part, the various cores are off doing their own things independently, the level of interaction between the cores is nominal, and it’s likely to be bursty and relatively well-behaved (to the extent that any smart phone can be thought to behave well).
Neither of these scenarios resembles desktop multicore in the slightest. But the third wave of embedded multicore is now bringing that desktop flavor into the embedded world—not in place of the other architectures, but on top of them. Instead of one application processing core, for example, you may now have a multicore application processor.
This is the symmetric multiprocessing (SMP) world. It’s simple in that the application programmer doesn’t have to worry about the details of the cores as much as is necessary for heterogeneous or pipelined architectures. All the cores look alike and see the same memory. Suitable operating systems (OSs)—say, Linux SMP—take care of the scheduling of which thread runs on what core at any given time, and that’s no small task. Scheduling is a science (and possibly an art) in its own right.
SMP is complex, however, as there is much more inter-core communication. Unlike the heterogeneous case, where you might have the occasional task handoff or results return, SMP typically implies that a single computing problem has been broken apart so that inherent concurrency can be exploited and multiple aspects of the program can execute at the same time.
But programs tend to have lots of dependencies. When you pull the program apart and assign those parts to different cores, you have to make sure that some result being calculated on Core 1 gets communicated to Core 2 so it can proceed with its own work.
This means much more communication. The fact that the OS does the scheduling, typically on a semi-deterministic basis, means you usually can’t be sure which core will run which piece of code when, which cores it will have to talk to, and when they will be ready with their results or need your results.
Applications, especially timing-critical ones, therefore have ample opportunity to trip over their own shoelaces if synchronization isn’t performed properly. This makes the verification process that much more critical and difficult.
What Makes Multicore More Difficult?
The issues that make emulation necessary for single-core systems are compounded for multicore systems in two regards.
The first is a simple issue of scale. A core is a complicated piece of logic, and loading a single core into a simulator is a task. Loading multiple cores into a simulator is onerous indeed. Frankly, a large multicore system can even strain the bounds of an emulator. This is where multi-billion-gate emulation becomes important.
Of course, you might simply say that if software on a single-core system really needs to be emulated to check it out thoroughly, then multicore only makes that worse. Nothing more needs to be said. But some subtleties to multicore verification bear consideration, even if they simply add more weight to the need for emulation.
One not-so-subtle subtlety is the challenge of concurrency. While it is possible to simulate concurrency, execution is, in fact, happening sequentially, since no commercial simulators have been successfully parallelized. The event-driven nature of the problem simply doesn’t allow for easy independent operation. Even if you tried to use independent simulation engines, it would be very hard—if even possible—to synchronize them up to ensure that they were marching in clock-by-clock lockstep.
Because an emulator is implemented in hardware, it allows explicit concurrent operation by its very nature. Independent hardware islands can execute independently, communicating when needed, and staying in lockstep through the straightforward mechanism of the clock signal.
In addition, verifying multicore software involves considerations that simply don’t exist for sequential execution. Your first task is to ensure that your code doesn’t suffer from any of the archetypical multicore problems: deadlocks (and their ilk) and data corruption due to multiple pieces of the code writing to and reading from the same memory locations in a manner that should be atomic but, in fact, isn’t.
These issues can be exposed or hidden based on timing. Cycle-accurate execution may reveal issues that might not have been evident in other environments. In addition, if the OS is doing the scheduling, then you might need to play with scheduling algorithms or just run multiple schedules to convince yourself that your code is robust in the face of numerous timing scenarios. You need the fastest possible execution to run through as many scenarios as you can.
Even if you know where there may be some vulnerability that you want to check out with specific directed tests, setting those tests up in a simulator can be very difficult. You have to establish an extraordinarily complex state for the test to proceed. It’s much easier, based on the speed of an emulator, simply to run the system through an execution trajectory to arrive at the desired state “organically” rather than trying to set it up manually.
Another issue unique to multicore is the concept of cache coherency. If multiple cores are looking at the same memory but executing different pieces of code, their caches will have different contents. But a given piece of data may be stashed in more than one cache, and, if it changes in one, it needs to change everywhere else. Elaborate schemes and algorithms are used to make this work, and they need to be verified.
The only way to do that on a simulator is to stub out as much of the design as possible so you can actually do some testing. That means fake cores and handcrafted test scenarios that will hopefully cover the real-world situations that would occur as the actual SoC hums along. By emulating, you can let the cores themselves manage their caches with real code, and you can then confirm that all is proceeding apace.
A similar situation exists for communication across the chip, and this can be a huge consideration. With a single core, you’ve basically got one master running things. With multicore, you’ve got multiple masters trying to get to the memory or peripherals or send messages to each other. That may happen across a bus, through a network-on-a-chip (NoC), or a by using combination of the two.
The critical question is if all of the required communication can happen without bogging things down due to congestion. The other side of the question could be asked as well: has the communication infrastructure been over-provisioned, raising the possibility of some cost savings? You would not want to discover any of this at such a late stage in the game, since these are architectural issues. But if the problems exist, you would rather catch them even at a late stage than not. Early on, you can only work with statistical models to build confidence in your architecture. It’s only when the design is complete that you can test real traffic on the real design (see the figure).
With simulation, you again have to stub out the actual senders and receivers of traffic, modeling the traffic. Emulation is the only way to see the whole system intercommunicating in the manner that it will do once cast into silicon.
Fundamentally, this third wave of multicore is adding the kind of execution complexity that makes execution of real software on the real design a necessity. Any approach that doesn’t include emulation is a recipe for no sleep as you get to tape-out, and then further sleepless nights as you lie awake wondering whether the chip will work. Emulation in the mix not only will get the chip ready sooner, it will contribute confidence that the chip will work and set you up for some well-earned peaceful slumber.