Lecture addresses scalability in multicore systems

Building scalable, multicore systems with all cores having access to a single shared memory is a concern of MIT Professor Nickolai Zeldovich. He described how to make such computers perform well in the online course “Tackling the Challenges of Big Data,” which runs through March 17.

Ideally, he said, a computer with n cores will do n times the work of a computer with one core, but serial sections of the system (in which, for instance, a core must wait for access to a particular memory location) impose restrictions that can’t be lifted by adding more cores. Note that total execution time might be (p/n) + s, where p represents the time a parallel section would execute on a single core, and s equals the time for serial sections to execute. This equation approaches s as n goes to infinity—more cores do nothing to reduce serial execution times.

Partitioning can provide an alternative, he said, but it might be difficult to break up your system into n monolithic independent chunks, and you might face load-balancing problems. So, he said, his lecture will avoid talk of partitioning and focus on topics such as how cache coherence affects scalability.

For performance reasons, each core will have an independent cache, which poses a consistency problem—core 0 might write a value 5 to its cache, corresponding to a particular location in main memory, while core 1 might write a value of 8 to the location in its cache corresponding to the same location in main memory.

How, he asked, can we achieve sequential consistency (as if each core directly accessed main memory in turn) in a multicore design with many private caches? The answer, he said, is to use a cache coherence protocol, in which a cache assigns a state—such as shared, modified, or invalid—to affected locations and communicates that information to other caches.

He then addressed how to implement a lock on a cache-coherent shared-memory multicore system to uphold higher level invariance. He described a ticket lock as similar to taking a number when entering a bakery—customers wait for service until their number is called. Unfortunately, the ticket lock can result in a performance or scalability collapse—after a certain point, performance degrades as you add more cores, because handing off a lock from core to core involves overhead proportional to the number of cores trying to acquire the lock.

An alternative is the Anderson lock, which makes use of an array called a go array that prevents multiple cores from contending for the same cache line and avoiding the performance collapse. The downside is that the go array must be allocated in advance to be large enough to accommodate the maximum number of cores or threads your system could have. The more complex MCS lock avoids this overhead. Yet another approach is the read-write lock, in which multiple readers can simultaneously acquire a lock, but not multiple writers.

He concluded by describing read-copy-update, or RCU, which can avoid locks altogether for read-heavy workloads.

See these related posts on the online course “Tackling the Challenges of Big Data”: