[Design Application]
Optimize Memory Subsystem For Top Performance
A Better Understanding Of Memory Accesses Allows DSP Memory Subsystems To Be Better Matched To The DSP Chips.
Contributing Author
|
ED Online ID #7628 |
May 25, 1998
Designers are increasingly using multiple DSP chips in applications
that contain huge data sets--tens to hundreds of megabytes. Such applications
can no longer be economically implemented with static RAMs, most of
which typically have maximum capacities of 512 kbytes. Consequently,
many system designers must consider the use of dynamic RAMs (DRAMs)
to provide the larger memory space. Most DRAMs, however, are designed
for PC workstations. To optimize DRAM use in DSP applications, designers
must select the correct DRAM technology based on a different set of
goals.
In addition, most DSP chips are optimized for I/O handing, and that
typically means an interface optimized for use with SRAMs. As a result,
overall memory subsystem performance in a DSP application depends
on both the memory technology and the DSP chip's external interface.
Designers can pick from several DRAM architectures, each of which
brings a number of pros and cons for various DSP system implementations.
Thus, a better understanding of DRAM architectures and the DSP memory
interface will allow designers to better optimize the memory subsystem
for multiprocessor DSP applications.
On PCs, short read bursts for instruction cache-line fills have dominated
accesses to main memory. But the increasing use of object-oriented
languages and multitasking operating systems on PCs has lead to a
significant number of accesses that are dispersed throughout main
memory. This, in turn, has lead to an increasing emphasis on random-access
latency instead of solely on burst-access time for subsequent reads
to an open DRAM page.
Due to the emphasis on random-access latency, many PC manufacturers
were slow to replace EDO (extended data out) DRAMs with synchronous
DRAM (SDRAM) technology, which emphasizes burst accesses. In a typical
66-MHz memory implementation, SDRAM adds a cycle of latency on the
initial access in exchange for one less cycle on each of the subsequent
accesses. For a four-clock burst the net result is a two-cycle savings,
but that is only relevant if more than just the first fetch was needed.
Differing Emphasis In a DSP system, the speed of instruction loads is generally not
the main concern. Signals are typically processed as vectors, which
are many times the length of the data cache line. The code for the
tight inner loops of signal processing is typically loaded once for
a long vector of data. The emphasis, therefore, is on the speed of
both the subsequent reads to the same cache line and for immediate
access to sequential memory locations.
The workhorse dynamic memories like standard fast-page mode (FPM)
DRAMs, EDO DRAMs, and burst-EDO DRAMs are basically the same, save
for some differences in the interface for reading data out at the
time of the column access strobe (CAS) signal. With FPM DRAMs, the
CAS signal causes data to be read directly from the sense amplifiers.
EDO DRAMs add a latch to the output of those sense amplifiers, which
allows the data-output buffers to stay on even after the rising edge
of CAS. The result is a faster cycle time from column address to column
address--up to a third faster than standard FPM DRAMs.
Burst-EDO DRAMs replace the output latch on the EDO DRAM with a register.
That adds an internal pipeline stage, which allows data within a burst
to come out quicker after the CAS signal for the second and subsequent
accesses in the burst. The trade-off is an extra pipeline stage for
the CAS signal on the first access, but this does not lower performance
because the first data access is limited by the row access strobe
(RAS) time, not the CAS time.
SDRAMs present more of an architectural change from FPM DRAMs than
do the EDO DRAM variations. From the DSP system designer's standpoint,
the important differences are that SDRAMs are synchronous and use
a clock input. An internal SDRAM divides the memory into multiple
banks, each with its own row decoder and sense amps. Current high-performance
SDRAMs use four internal memory banks, although earlier versions typically
used two banks (Fig. 1 ).
The multibank architecture eliminates gaps between data accesses
because data can be accessed from one bank while the others are precharging.
The SDRAMs buffer both inputs and outputs, and that does affect the
latency for the first access in a burst. The increased pipelining,
though, enables both quicker access to a full burst and operation
at higher frequencies, compared to EDO DRAMs.
As a result, one of the key performance issues becomes how the system
can deal with pipelined memory operations. The highest memory-to-processor
throughput is achieved by using the multiple accesses inherent in
the bursts of a cache line load. If that approach isn't used, the
access rate is limited by the speed of the address bus, which usually
has a duty cycle of only a percentage of the data bus. To reach the
full potential of pipelined memory systems, the pipeline should be
full as long as possible. Like a pump that needs priming, the data
through a pipelined memory system will incur startup latency after
any time the pipeline stalls. Accessing long vectors typically used
in signal processing data arrays helps keep the pipeline full.
Match Latency To Pipeline When evaluating the various memory technologies for use in DSP systems,
the designer should match each technology to the processor's capabilities.
That is, the latency of the memory subsystem should be matched to
the pipeline capabilities of the processor. The more pipelining in
the processor, the higher the latency it can tolerate in the memory
and the memory controller without affecting throughput.
<-- prev. page
[1]
2
3
next page -->