• Channels
Part Inventory
Go
 
powered by:

 
  • Quick Poll
What Social Networking site do you use the most?



VOTE VIEW RESULTS
Previous Polls

Premium Content

New Signal Chain Technical Papers from Texas Instruments:

 

 

 

Optimize Memory Subsystem For Top Performance

A Better Understanding Of Memory Accesses Allows DSP Memory Subsystems To Be Better Matched To The DSP Chips.


Contributing Author

May 25, 1998

Print
Reprints Comment Subscribe

Designers are increasingly using multiple DSP chips in applications that contain huge data sets--tens to hundreds of megabytes. Such applications can no longer be economically implemented with static RAMs, most of which typically have maximum capacities of 512 kbytes. Consequently, many system designers must consider the use of dynamic RAMs (DRAMs) to provide the larger memory space. Most DRAMs, however, are designed for PC workstations. To optimize DRAM use in DSP applications, designers must select the correct DRAM technology based on a different set of goals.

In addition, most DSP chips are optimized for I/O handing, and that typically means an interface optimized for use with SRAMs. As a result, overall memory subsystem performance in a DSP application depends on both the memory technology and the DSP chip's external interface.

Designers can pick from several DRAM architectures, each of which brings a number of pros and cons for various DSP system implementations. Thus, a better understanding of DRAM architectures and the DSP memory interface will allow designers to better optimize the memory subsystem for multiprocessor DSP applications.

On PCs, short read bursts for instruction cache-line fills have dominated accesses to main memory. But the increasing use of object-oriented languages and multitasking operating systems on PCs has lead to a significant number of accesses that are dispersed throughout main memory. This, in turn, has lead to an increasing emphasis on random-access latency instead of solely on burst-access time for subsequent reads to an open DRAM page.

Due to the emphasis on random-access latency, many PC manufacturers were slow to replace EDO (extended data out) DRAMs with synchronous DRAM (SDRAM) technology, which emphasizes burst accesses. In a typical 66-MHz memory implementation, SDRAM adds a cycle of latency on the initial access in exchange for one less cycle on each of the subsequent accesses. For a four-clock burst the net result is a two-cycle savings, but that is only relevant if more than just the first fetch was needed.

Differing Emphasis
In a DSP system, the speed of instruction loads is generally not the main concern. Signals are typically processed as vectors, which are many times the length of the data cache line. The code for the tight inner loops of signal processing is typically loaded once for a long vector of data. The emphasis, therefore, is on the speed of both the subsequent reads to the same cache line and for immediate access to sequential memory locations.

The workhorse dynamic memories like standard fast-page mode (FPM) DRAMs, EDO DRAMs, and burst-EDO DRAMs are basically the same, save for some differences in the interface for reading data out at the time of the column access strobe (CAS) signal. With FPM DRAMs, the CAS signal causes data to be read directly from the sense amplifiers. EDO DRAMs add a latch to the output of those sense amplifiers, which allows the data-output buffers to stay on even after the rising edge of CAS. The result is a faster cycle time from column address to column address--up to a third faster than standard FPM DRAMs.

Burst-EDO DRAMs replace the output latch on the EDO DRAM with a register. That adds an internal pipeline stage, which allows data within a burst to come out quicker after the CAS signal for the second and subsequent accesses in the burst. The trade-off is an extra pipeline stage for the CAS signal on the first access, but this does not lower performance because the first data access is limited by the row access strobe (RAS) time, not the CAS time.

SDRAMs present more of an architectural change from FPM DRAMs than do the EDO DRAM variations. From the DSP system designer's standpoint, the important differences are that SDRAMs are synchronous and use a clock input. An internal SDRAM divides the memory into multiple banks, each with its own row decoder and sense amps. Current high-performance SDRAMs use four internal memory banks, although earlier versions typically used two banks (Fig. 1).

The multibank architecture eliminates gaps between data accesses because data can be accessed from one bank while the others are precharging. The SDRAMs buffer both inputs and outputs, and that does affect the latency for the first access in a burst. The increased pipelining, though, enables both quicker access to a full burst and operation at higher frequencies, compared to EDO DRAMs.

As a result, one of the key performance issues becomes how the system can deal with pipelined memory operations. The highest memory-to-processor throughput is achieved by using the multiple accesses inherent in the bursts of a cache line load. If that approach isn't used, the access rate is limited by the speed of the address bus, which usually has a duty cycle of only a percentage of the data bus. To reach the full potential of pipelined memory systems, the pipeline should be full as long as possible. Like a pump that needs priming, the data through a pipelined memory system will incur startup latency after any time the pipeline stalls. Accessing long vectors typically used in signal processing data arrays helps keep the pipeline full.

Match Latency To Pipeline
When evaluating the various memory technologies for use in DSP systems, the designer should match each technology to the processor's capabilities. That is, the latency of the memory subsystem should be matched to the pipeline capabilities of the processor. The more pipelining in the processor, the higher the latency it can tolerate in the memory and the memory controller without affecting throughput.

Designers are increasingly using multiple DSP chips in applications that contain huge data sets--tens to hundreds of megabytes. Such applications can no longer be economically implemented with static RAMs, most of which typically have maximum capacities of 512 kbytes. Consequently, many system designers must consider the use of dynamic RAMs (DRAMs) to provide the larger memory space. Most DRAMs, however, are designed for PC workstations. To optimize DRAM use in DSP applications, designers must select the correct DRAM technology based on a different set of goals.

In addition, most DSP chips are optimized for I/O handing, and that typically means an interface optimized for use with SRAMs. As a result, overall memory subsystem performance in a DSP application depends on both the memory technology and the DSP chip's external interface.

Designers can pick from several DRAM architectures, each of which brings a number of pros and cons for various DSP system implementations. Thus, a better understanding of DRAM architectures and the DSP memory interface will allow designers to better optimize the memory subsystem for multiprocessor DSP applications.

On PCs, short read bursts for instruction cache-line fills have dominated accesses to main memory. But the increasing use of object-oriented languages and multitasking operating systems on PCs has lead to a significant number of accesses that are dispersed throughout main memory. This, in turn, has lead to an increasing emphasis on random-access latency instead of solely on burst-access time for subsequent reads to an open DRAM page.

Due to the emphasis on random-access latency, many PC manufacturers were slow to replace EDO (extended data out) DRAMs with synchronous DRAM (SDRAM) technology, which emphasizes burst accesses. In a typical 66-MHz memory implementation, SDRAM adds a cycle of latency on the initial access in exchange for one less cycle on each of the subsequent accesses. For a four-clock burst the net result is a two-cycle savings, but that is only relevant if more than just the first fetch was needed.

Differing Emphasis
In a DSP system, the speed of instruction loads is generally not the main concern. Signals are typically processed as vectors, which are many times the length of the data cache line. The code for the tight inner loops of signal processing is typically loaded once for a long vector of data. The emphasis, therefore, is on the speed of both the subsequent reads to the same cache line and for immediate access to sequential memory locations.

The workhorse dynamic memories like standard fast-page mode (FPM) DRAMs, EDO DRAMs, and burst-EDO DRAMs are basically the same, save for some differences in the interface for reading data out at the time of the column access strobe (CAS) signal. With FPM DRAMs, the CAS signal causes data to be read directly from the sense amplifiers. EDO DRAMs add a latch to the output of those sense amplifiers, which allows the data-output buffers to stay on even after the rising edge of CAS. The result is a faster cycle time from column address to column address--up to a third faster than standard FPM DRAMs.

Burst-EDO DRAMs replace the output latch on the EDO DRAM with a register. That adds an internal pipeline stage, which allows data within a burst to come out quicker after the CAS signal for the second and subsequent accesses in the burst. The trade-off is an extra pipeline stage for the CAS signal on the first access, but this does not lower performance because the first data access is limited by the row access strobe (RAS) time, not the CAS time.

SDRAMs present more of an architectural change from FPM DRAMs than do the EDO DRAM variations. From the DSP system designer's standpoint, the important differences are that SDRAMs are synchronous and use a clock input. An internal SDRAM divides the memory into multiple banks, each with its own row decoder and sense amps. Current high-performance SDRAMs use four internal memory banks, although earlier versions typically used two banks (Fig. 1).

The multibank architecture eliminates gaps between data accesses because data can be accessed from one bank while the others are precharging. The SDRAMs buffer both inputs and outputs, and that does affect the latency for the first access in a burst. The increased pipelining, though, enables both quicker access to a full burst and operation at higher frequencies, compared to EDO DRAMs.

As a result, one of the key performance issues becomes how the system can deal with pipelined memory operations. The highest memory-to-processor throughput is achieved by using the multiple accesses inherent in the bursts of a cache line load. If that approach isn't used, the access rate is limited by the speed of the address bus, which usually has a duty cycle of only a percentage of the data bus. To reach the full potential of pipelined memory systems, the pipeline should be full as long as possible. Like a pump that needs priming, the data through a pipelined memory system will incur startup latency after any time the pipeline stalls. Accessing long vectors typically used in signal processing data arrays helps keep the pipeline full.

Match Latency To Pipeline
When evaluating the various memory technologies for use in DSP systems, the designer should match each technology to the processor's capabilities. That is, the latency of the memory subsystem should be matched to the pipeline capabilities of the processor. The more pipelining in the processor, the higher the latency it can tolerate in the memory and the memory controller without affecting throughput.

Average (0 Ratings):

Subscribe
Subscribe to Electronic Design and start receiving more articles like this one
Filed Under:

Check for price and availability on Source ESB:

Go
powered by  
    There are no comments to display. Be the first one!
You must log on before posting a comment.

Are you a new visitor? Register Here
Acceptable Use Policy

Sponsored Links