Embedded RAM In FPGAs Enables FIFO Applications
The asynchronous boundary between system clocks presents one of the most challenging issues in digital design. A solution to this problem is to use an off-the-shelf FIFO memory. But even though they're a relatively direct answer to a difficult problem, FIFOs themselves can be challenging to implement.
For example, a random pixel pattern on the left vertical column of a cathode-ray-tube (CRT) display is probably the result of a FIFO protocol system bug. This is caused by the fact that some FIFOs deliver data one cycle later than the NOT_EMPTY flag assertion, which signals data available. This protocol may be fortuitous to a semiconductor vendor's data-sheet specs, or be convenient for an IP-core vendor's HDL code. But it can cause grief with a designer's state-machine interface to a FIFO.
Engineers can take control of such issues by creating their own FIFO design, using embedded dual-port RAMs in FPGAs. The dual-port feature is crucial for isolating the write side of the RAM from the read side, both architecturally and in time. Dual-port, as the name implies, means the RAM separates both address and data from write and read operations. Any word in the RAM can be written simultaneously with the read operation of any other word. Simultaneous write and read of the same word, of course, must be avoided.
Designers can configure the width and depth of a FIFO memory to exactly fit an application. Does a communications application require 9-, 18-, 36-, or whatever number bit words to match unique requirements, like parity or error correction? If so, embedded RAMs provide a variety of configurations, such as 512 by 2; 256 by 4; 128 by 9; and 64 by 18. If an application requires unique flags beyond FULL and EMPTY, like HALF_FULL, ALMOST_FULL, or FULL_MINUS_7, the designer can easily customize an IP-core-generated FIFO by editing the HDL code to fit specific needs.
The basic synchronous FIFO design uses dual-port RAM for independent PUSH and POP operations with a common clock. Write data is pushed in the clock period that PUSH is asserted. Read data is popped in the clock period that POP is asserted. The FIFO can be cascaded by connecting the NOT_EMPTY line to PUSH and NOT_FULL to POP of adjacent FIFOs. Custom flags such as FULL_MINUS_7 can be added, simply by adding a comparator and a write-side counter initialized to seven.
When the FIFO isn't FULL and PUSH indicates that write data is present, data will be pushed into the write side of the dual-port RAM at the write address pointer, while incrementing the pointer. When the write pointer catches up with the read pointer, the FULL flag is set. It remains so until POP is asserted.
When the FIFO isn't EMPTY and POP indicates that read data is requested, data will be popped from the read side of the dual-port RAM at the read address pointer while incrementing the pointer. When the read pointer catches up with the write pointer, the EMPTY flag is set and remains set until PUSH is asserted.
This synchronous FIFO can be modified for asynchronous operation. First, separate the single clock into write- and read-side clocks: WCLK and RCLK (Fig. 1). Next, the hold conditions for FULL and EMPTY must be equated to compare the write and read pointers for equality. When these pointers are equal, FULL and EMPTY must stay set (if either were set). Using NOT_POP to hold FULL or NOT_PUSH to hold EMPTY, as in the synchronous design, will not work across the asynchronous boundary.
Since the write-side and read-side state-machine dependencies are purely combinatorial, the MTBF of the asynchronous FIFO is identically equal to the metastable characteristic of a single master-slave flip-flop. Occasional FULL and EMPTY conditions will assert and fall back according to the flip-flop metastable characteristics. The false assertions that occur due to binary rollover decode spikes can be reduced by using Grey-code counters. Even though the Grey-code sequence allows only 1-bit transitions, there still will be occasional metastable events. But these will occur at the minimum rate.
The strategy employed in this FIFO design maximizes MTBF. It isolates the write and read state machines as two separate synchronous systems that communicate over the "asynchronous boundary" with the minimum number of combinatorial signals. A description of the write and read state machines follows.
When the FIFO memory isn't FULL and PUSH indicates that write data is present, data will be pushed into the write side of the dual-port RAM at the write address pointer, while incrementing the pointer. When the write pointer catches up with the read pointer, the FULL flag is set and remains set as long as the read and write addresses are equal.
When the FIFO isn't EMPTY and POP indicates that read data is requested, data will be popped from the read-side of the dual-port RAM at the read address pointer while incrementing the pointer. When the read pointer catches up with the write pointer, the EMPTY flag is set and remains set as long as the read and write addresses are equal.
The write-side and read-side state machines can be modeled as a dual synchronizer. This synchronizer is the key to maximizing MTBF by raising slack time at high clock frequencies. This relationship requires a review of some metastability basics.
Metastability occurs when data changes on a clock transition. The system MTBF is limited by the slack time, and is determined by:
where t is slack time available for settling, K1 and K2 are constants proportional to the flip-flop's gain-bandwidth product characteristic, FCLOCK is the frequency of the synchronizing clock, and FDATA is the frequency of the asynchronous data.
The dual-synchronizer design employs two flip-flops in series to capture the asynchronous event (Fig. 2). The first flip-flop resolves the event after the clock-to-out, metastable, and routing delays. The second flip-flop pipelines the event after the setup (tSU) and clock-skew (tSKEW) delays. The available time to resolve the metastable condition is the margin, or slack in the clock period, defined as:
Slack = Clock period - Path delay
where the path delay is the clock-to-out delay, routing delays, tSU, and tSKEW.
The goal is to have enough slack time to push the MTBF to greater than a few hundreds of years (1.0 E + 9 s).
The second flip-flop serves two critical functions in maximizing the MTBF. First, fan-out delays for the state machine are pushed to the next clock cycle, freeing up valuable slack time. Second, the single metastable detector can detect only legal states, compared to multiple detectors that could interpret the event differently.
The second operates synchronously with the FIFO state-machine logic (write side or read side) within the bounds of the MTBF prescribed by the first synchronizing flip-flop. The following path-analysis results were obtained for a 32-word by 32-bit asynchronous FIFO test design:
Almost Full to Full = 3 ns
Almost Empty to Empty = 3 ns
Slack = 7 ns at 100 MHz
Thus, the worst-case delay, 3 ns, allows a 7-ns slack at 100 MHz. This is long enough to provide an acceptable MTBF.
Embedding the asynchronous FIFO in a QuickPCI device enables design verification. The PCI interface supplies data to the write-side FIFO memory (Fig. 3). The read-side FIFO is output to a local bus by a variable clock that is asynchronous to the PCI clock. The local-bus data is looped back to another FIFO that outputs back to the PCI interface.
A demo board is available to provide the variable clock, local bus, and PCI bus to a PC that's running Windows NT 4.0. Available software writes a stream of test data to the FIFOs, reads the data back, and compares the sent and received data. While this is happening, the clock frequency increases from 1 to 100 MHz under software control. No errors were detected in the 32-word by 32-bit asynchronous FIFO that was tested.
In a real system, such as an Ethernet-to-PCI interface, asynchronous FIFOs smooth the flow of data between Ethernet and PCI system clocks. (Fig. 4). Data is transferred from system memory, through the FIFOs, to the Ethernet controller in burst mode using the PCI master-mode DMA. The FIFOs handle data-burst-rate mismatches by holding data and delivering on demand, while synchronizing to both the PCI and Ethernet clocks.
The PCI master/slave controller can be implemented by the QL5032 using the PCI burst transfer mode for zero-wait-state transfers at up to PCI's maximum of 132 Mbytes/s. The QL5032—a member of the QuickLogic QuickPCI Embedded Standard Product (ESP) family—provides a complete and customizable PCI interface, plus 12,000 gates of programmable logic.
The programmable-logic portion of the device contains over 300 logic cells and 14 dual-port RAM blocks. These configurable RAM blocks implement 32-bit FIFOs up to 384 words deep, large enough to hold one 1526-byte Ethernet frame.
The QL5032 meets PCI 2.1 electrical and timing specifications, and supports the Win '98 and PC '98 standards. The device runs on 3.3 V, with multivolt-compatible I/Os. So, it can operate in 3-V-only systems, as well as mixed 3.3-/5-V systems.
Designers should take away three things from this report:
- Dual-port RAM is the best type of embedded memory for FIFO applications.
- A design with a single-synchronizer and Full/Empty flag with Grey-code pointers is suitable for a low-speed, asynchronous FIFO.
- A dual synchronizer with an Almost Full/Empty flag and Grey-code pointers is the best choice to ensure the maximum MTBF in an asynchronous FIFO application.
For more information on metastable states, point your browser to http://rk.gsfc.nasa.gov/fpgas.htm:. It features a discussion of metastable states, sample calculations with a MTBF calculator, and a reference list.