Potentially substantial performance gains from the use of multithreading and multiprocessing architectures have captured the attention of designers of consumer devices and other electronic products. Multithreading uses cycles when the
processor would otherwise sit idle to process instructions from
other threads. Multiprocessing, on the other hand, introduces
additional independent processing elements in order to execute threads or applications concurrently. Embedded applications running in multiprocessor and multithreading architectures, just like those running in conventional applications,
require interrupt service routines (ISRs) to handle interrupts
generated by external events.
One key challenge for designers implementing these new
technologies is avoiding the situation where one thread is
interrupted while modifying a critical data structure, enabling a
different thread to make other changes to the same structure.
Conventional applications overcome this problem by briefly
locking out interrupts while an ISR or system service modifies
crucial data structures.
In a multithreaded or multiprocessing application, this
approach isn't sufficient because of the potential for a switch
to a different thread context (TC), or access by a different processing element that's not impeded by the interrupt lockout. A
more comprehensive approach is required, such as disabling
multithreading or halting other processing elements while the
data structure is being modified.
IMPROVING PERFORMANCE
Manufacturers of consumer devices and other embedded computing products are
eagerly adding new features, such as Wi-Fi, VoIP, Bluetooth, and
video. Historically, increased feature sets have been accommodated by ramping up the processor's clock speed. In the
embedded space, this approach rapidly loses viability because
most devices are already running up against power consumption and real-estate constraints that limit additional processor
speed increases. Cycle-speed increases drive exponentially
greater power consumption, making high cycle speeds unmanageable for more and more embedded applications.
In addition, processors are already so much faster than memory that more than half the cycles in many applications are spent
waiting while the cache line is refilled. Each time there's a cache
miss or another condition that requires off-chip memory access,
the processor needs to load a cache line from memory, write those words into the cache, update the translation lookaside
buffer (TLB), write the old cache line into memory, and resume
the thread. MIPS Technologies stated that a high-end synthesizable core taking 25 cache misses per thousand instructions (a
plausible value for multimedia code) could be stalled more than
50% of the time if it must wait 50 cycles for a cache fill.
MULTITHREADING APPROACH
Multithreading solves
this problem by using the cycles that the processor would otherwise waste while waiting for memory
access. It can then handle multiple concurrent threads of program execution.
When one thread stalls waiting for memory, another thread immediately presents
itself to the processor to keep computing
resources fully occupied.
Notably, conventional processors
can't use this approach because it
takes a large number of cycles to
switch the TC from one to another. Multiple application threads must be
immediately available and "ready-to-run" on a cycle-by-cycle basis for this
approach to work. MIPS accommodates this requirement through its
incorporation of multiple TCs, each of
which can retain the context of a distinct application thread ().
In a multithreaded environment such
as the MIPS 34K processor, performance can be substantially improved—
when one thread waits for a memory
access, another thread can use that
processor cycle that would otherwise
be wasted.
shows how multithreading
can speed up an application. With just
Thread0 running, only five out of 13
processor cycles are used for instruction execution and the rest are spent
waiting for the word to be loaded into
cache from memory. In this case, when
using conventional processing, the efficiency is only 38%. Adding Thread1
makes it possible to use five additional
processor cycles that were previously
wasted. With 10 out of 13 processor
cycles now used, efficiency improves to
77%, providing a 100% speedup over
the base case. Adding Thread2 makes
it possible to fully load the processor,
executing instructions on 13 out of 13
cycles for 100% efficiency. This represents a 263% speedup when compared
to the base case.