[Leapfrog: First Look]
1.5-GHz FPGA Takes Clock Gating To The Max
William Wong
ED Online ID #19952
November 7, 2008
Copyright © 2006 Penton Media, Inc., All rights reserved. Printing of this document is for personal use only.
Reprints
Flexibility is key to FPGA
success, but speed is
equally important. Achronix
almost triples the throughput
of the system by taking clock
gating to the extreme. The Achronix
Speedster FPGAs use a unique pipeline
architecture but completely hide
it from developers. Designers can use
the devices with unaltered Verilog,
VHDL, or RTL. Developers also can
continue to use development tools like
Synplicity’s Synplify-Pro and Mentor
Grahpics’ Precision.
Speedster’s overall architecture
(Fig. 1) and specs (see the table) look
like most FPGAs. It is RAM-based
and built around four input lookup
tables (LUTs). Also, it has the usual
complement of I/O interfaces, including
high-speed serializer/deserializers
(SERDES) and memory controllers.
Most importantly, picoPIPE elements—
which no other FPGA has—
are sprinkled throughout the interconnect
fabric (Fig. 2).
The picoPIPE elements change the
way things work within Speedster.
In a conventional FPGA, LUTs are
connected together and data flows
from one latch to another. The latch
clocks are normally synchronized, and
clock distribution and synchronization
are major limitations in conventional FPGAs as well. This becomes important in pushing the performance
boundaries of the system.
System clocking must account
for the delay through the LUTs. This
means that the clock rate will be limited
by the maximum delay through the
longest chain of LUTs. In the sample
example above (#1), the delay would
be three LUTs. Achronix makes a different
assumption by placing a picoPIPE
between each stage.
In the first example (#2), the clock
rate of the Speedster can be increased
by a factor of four because the queue
will include this many states. The shortest
chain will limit the maximum number
of states a subsystem can contain.
If #2, #3, and #4 are used in a design,
then #4 is the limiting factor with only
two states. If only #2 and #3 were
used, then the subsystem could handle
up to four states.
The picoPIPEs operate in an asynchronous
fashion. This is significant
because it eliminates the clock distribution
and synchronization problems
since clocks that are only used with
latches and the source and destination
clocks do not have to be synchronized.
They do need to operate at the same
speed, though.
STEP BY STEP
Following data through the system
helps understand how things work. The
first piece of data (1) enters the systems
when the left-most set of latches is
clocked. In the conventional FPGA, the
data will propagate through the LUTs,
and it will be available at the other latch
when the next clock cycle occurs.
The LUT delay limits the clock rate, as
already noted.
With Speedster, the first data item
will run through the picoPIPE FIFOs
until it gets to the other end of the system.
It is removed when the latch on
the right side is clocked.
The delay through the LUTs is the
same, but there can be more than one
piece of data within the subsystem.
If a piece of data essentially “bumps”
into the next piece, it will remain in the
prior picoPIPE stage until the data is
removed from the next stage.
This approach permits different
length paths such as #2 and #4 to
operate within the same subsystem.
But the number of items within the
subsystem is limited by the smallest
number of picoPIPEs within any one
chain of computation. It is possible to
insert empty stages without LUTs as in
#3 so its FIFO length matches the other
stage (#2).
If all three sample stages (#2, #3, and
#4) are used, then only two pieces of
data can be in the subsystem at a time.
If there are more, then data will be lost
as in a typical FIFO architecture. Also,
the clock rate of the system is now
limited by the delay for a single LUT, not
the overall chain. This gives the Speedster
its high-throughput characteristics.
Achronix effectively hides the
picoPIPE from the development
process except for optimization and
tuning. The place-and-route system
automatically allocates picoPIPE elements.
The developer gets the same
kind of throughput information that a
typical FGPA place-and-route software
package will provide, but Achronix
additionally provides information about
picoPIPE usage.
BASED ON ECLIPSE
The Achronix CAD Environment
is an Eclipse-based tool. It provides
advanced place and route, timing analysis,
and critical path analysis. It lets
developers tune the use of picoPIPE
stages. As with FPGAs in general, there is a limitation on the number of items
and routes available to the place-androute
software, so usage does not hit
100% even when a design hits the limit
of the hardware. That is one reason
why there are lots of picoPIPE elements
on a chip. This tends to be the limit of
picoPIPE exposure to developers.
It will be interesting to see whether
this opens up in the future since the
self-clocking FIFO architecture opens
up significant design possibilities.
Achronix has taken clock gating to
the extreme without the problem of
synchronization. This technology is a
game changer. These FPGAs aren’t for
everyone. When it comes to pushing
the envelope, though, Speedster looks
to beat even ASICs.
Pricing for the Speedster starts at
$200. The SPD60 will be the first one
available. A development kit provides
access to the platform.
ACRHONIX • www.achronix.com
See Online Associated poll
|