Now that the data clock has reached the receiving device, maximizing the loop delay through the GTLP receiver and the additional delay element will determine the best possible data rate given all the worst-case numbers for propagation delay. In the example shown in Figure 4, all the system variables consume 14.5 ns of time, which works out to a clock frequency of about 69 MHz (Table 2). Any attempts to clock data through the interface at a higher frequency cannot guarantee enough time to clock in valid data before the incoming data has started to change state at the receiver inputs.
Sending a single clock with the data has improved the interface clock frequency by 20 MHz, or 40% in terms of actual bandwidth or throughput. While 40% is a terrific gain based on the small amount of work required to reconfigure the interface, there's still room for improvement. One of the variables used in the previous solution is for device-to-device skew. Eliminating the device-to-device skew variable by sending a private clock signal from each GTLP interface device shortens the worst-case loop delay needed to clock or register the incoming data in the GTLP interface device. There's a design tradeoff. For better overall system performance, increase the number of interface clocks and accept the routing overhead of additional clock lines.
By eliminating performance limiting variables and minimizing the loopback delay (CLKOUT to CLKIN to CLKBA path) associated with the GTLP17T616, an optimal solution can be achieved. Variable elimination creates a tighter set of constraints surrounding the clock-data relationship allowing higher bandwidth across the same parallel interface. The key variable eliminated is the device-to-device skew. Removing skew from the data rate calculation allows the maximum CLKIN to CLKBA delay to be reduced by 4.25 ns.
Looking back to the original synchronous interface and comparing the results shows a 120% bandwidth enhancement over the traditional design. Considering that we're using the same backplane, similar products, and a new architecture, the results are impressive. Nothing in life is free, and this performance has a price. We added eight clock lines, or if you look at it another way, we took away eight data lines. The source-synchronous nature of the backplane signals dictates that the data must now be retimed to either a master system clock or on-card clock. The basic retiming will probably take place in a system ASIC, generally consisting of at least a register and most likely a synchronous FIFO. Either of these solutions adds to the latency of information across the backplane interface, but doesn't effect the overall bandwidth of the interface.
In order to put this performance into perspective, you must consider that current PCI specifications at 66 MHz allow four to five slots. In addition, a PCI-X proposal al-lows only three slots. The waveforms in Figure 6 are on a fully loaded eight-slot CompactPCI backplane. At nearly twice the frequency and twice the number of slots, this source-synchronous design offers a serious performance advantage.
The source-synchronous architecture is an ideal upgrade from traditional synchronous interface design. It improves the throughput of many passive backplanes. All interface architectures are bound to have tradeoffs, but the positives resulting from a source-synchronous design can far outweigh the negatives.
First, as timing budgets tighten, designers are moving toward this type of architecture in many application areas. Also, master clock skew requirements have been eased. This may allow for some cost savings in the clock distribution subsystem.
Furthermore, source-synchronous architectures add a FIFO or resynchronization requirement to system ASICs. Signal flight time is no longer a performance limiting factor. Plus, hold time margins are easier to meet.
In the future, source-synchronous designs will use state-of-the-art differential signaling techniques to further extend the capabilities of passive parallel backplane interfaces.