Understanding PCI-Bus Subtleties Optimizes System Performance

Feb. 1, 2000

10 min read

For more information on this topic, see “The Never-Ending Quest For Performance.”

The PCI bus offers a speedy data-transfer mechanism, one that far surpasses the ISA bus-if implemented properly. Because PCI originated as a signal-level hardware specification, board vendors are free to devise their own implementations.

As a result, you could mistakenly assume that all PCI data cards achieve comparable functionality. But especially when digitization rates exceed the megahertz range, you must investigate subtleties among various schemes before selecting the product best suited for a given application or when optimizing software to squeeze maximum performance from the hardware on hand.

In their rush to market, many manufacturers ported ISA designs to the PCI bus. While these products run fine under DOS or Windows 3.1, they exhibit poor performance under a true multitasking operating system (OS) or Microsoft Windows NT. These sophisticated environments demand a reengineered design optimized for the PCI bus.

This urgency for redesigns doesn’t arise for many types of peripherals such as disk controllers, video cards, or network adapters. These devices can handle a pause or interruption of data flow, and some even can invalidate a block of data and request a retransmission.

The effects are minimal: extra microseconds for a disk write, an imperceptible slowdown in networking, or a momentary jerky display. In contrast, a high-speed data process can’t tolerate the slightest interruption in data flow without incurring a loss of samples and data corruption.

PCI: Unfulfilled Promises?

Ever since the days of DOS and Windows 3.1, developers of data acquisition boards have sought ways to overcome performance limitations inherent in ISA bus systems (see sidebar). Theoretical peak bus throughput is only 2M transfers/s, but for practical applications, you couldn’t expect much more than 500k.

Manufacturers and users alike were delighted with the arrival of the PCI bus and its promises of significant performance enhancements. Unfortunately, data acquisition boards vary widely in the degree to which they exploit this bus. For instance, although data acquisition drivers could theoretically pass two 16-bit samples with each cycle on this 32-bit bus, most don’t. Vendors want to maintain compatibility in their designs with the 16-bit ISA bus, or they simply don’t want to rewrite 16-bit legacy code.

Another key aspect concerns the direct memory access (DMA) controller that, for the PCI bus, no longer resides on the motherboard. Instead, each board must provide its own bus-master controller that takes over for a CPU-free data transfer (see Figure 1 in the February 2000 issue). But even though the CPU is free for other tasks, it’s unwise to tie up the system bus for too long, so the need for buffering remains as in the ISA days.

Improper use of the bus also can have drastic consequences. While ISA requires four bus cycles for each data transfer, PCI can transfer a value every clock pulse in the Burst mode (two cycles for continuous operation), assuming the I/O board’s hardware is up to the task. Scrutinize board DMA implementation carefully. Did the manufacturer purchase dedicated chips, design custom devices or work with standard devices such as some digital signal processing (DSP) chips that incorporate a PCI interface?

For instance, a board that contains a powerful commercial bus chip may be touted as being bus-master-compatible. Unsaid is the fact that the manufacturer didn’t implement bus mastering in the driver, instead opting for the Repeat String operations it used for ISA cards and for which its driver is best suited.

As for the second option, it’s possible to find PCI boards whose custom logic runs at 20 MHz or slower. Although the PCI bus runs at 33 MHz, the board can feed continuous data only at 10 MHz, which induces Wait states. So while custom logic might add some interesting capabilities to a data acquisition board, it also can degrade overall system performance.
Yet another design issue concerns the concept of interrupt chaining introduced by the PCI bus. It’s not unusual for several PCI devices to share the same interrupt line, in which case the OS must sequence through several interrupt service routines (ISRs) to find the right one.

PCI interrupt logic also behaves differently from simple ISA interrupt logic, and merely porting a design to PCI can hamper execution. For instance, with logic that runs at rates far slower than the bus, an I/O board detects an event such as a Buffer Half Full and triggers an interrupt.

The ISR empties the buffer, clears the interrupt, and returns control to the CPU-but because of delays, the logic has insufficient time to reset and the interrupt is still asserted. This situation causes the OS to reschedule and rerun the ISR, unnecessarily because the ISR finds nothing to do. But don’t forget that a multitasking OS such as NT schedules ISRs at a very high priority and robs the system of execution time that other applications might desperately need.

Avoiding Overruns

Once the board places data values in a buffer, another issue concerns how an application accesses them without risking an overrun of new data from the I/O board. NT gives the highest priority to the foreground application, so if the data acquisition task runs in the background, make sure a compute-intensive process won’t hold up system operation. You can try adjusting priority settings, but it’s easy to botch them so the entire system runs sluggishly. Also, if you go overboard and inadvertently leave insufficient time for NT’s housekeeping, the OS halts all other operations until it takes care of critical system matters.

One way to alleviate overruns is to boost the number of buffers. In this scheme, a driver fetches a buffer from the Ready queue, loads it with digitized data, and places it on the Done queue. The user application reads data from buffers in the Done queue and places them back on the Ready queue.

Keeping track of all the buffers and their status can be a complex task, one in which both the driver and the application must cooperate. In addition, the buffer queue’s greatest advantage-the capability to dynamically allocate and free buffers during the data acquisition process-has severe consequences on performance because allocating memory is a CPU-intensive process.

Another issue concerns buffer size. Spreading overhead over large blocks of data makes a transfer more efficient, but a user application doesn’t have access to any of the datapoints until the buffer is filled and made available to the application. This problem becomes most noticeable if you want to present data in a steady view rather than with jerky motions.

Into the Ring

A more creative approach uses an advanced circular buffer (ACB) (Figure 2). When combined with applications that take advantage of this flexible buffering mechanism, the system as a whole runs much more efficiently.

In this scheme, the data acquisition driver allocates a large circular buffer in the application’s memory space. At its largest, the buffer should leave sufficient physical memory for most of the OS and active applications to prevent frequent disk swapping. At its smallest, to ensure continuous, gap-free acquisition, the buffer should be large enough to hold all data that arrives during the largest anticipated pause in user-application processing of buffer data due to multitasking. This also implies that the application must be able to process data at a faster rate than the rate of acquisition.

Once an acquisition is started, the board/driver stores data at the head of the buffer while the application generally reads data from the tail. Both operations occur asynchronously and can run at different rates. However, you can synchronize them by either timer notification or a driver event.

To receive notification on a specific sample or scan-count boundary, the driver segments the buffer into frames. Whenever incoming data crosses a frame boundary, the driver sends an event to the application. If multichannel acquisition is performed, the frame size should be a multiple of the scan size to keep pointer arithmetic from becoming unnecessarily complex.

With the ACB, three modes of operation are possible depending on the action taken when the end of the buffer is reached or if the buffer head catches up with the tail.

In the single buffer mode, acquisition stops when the driver reaches the buffer’s end. The user application can access the buffer and process data during acquisition or wait until the buffer is full. This approach is appropriate when you’re not acquiring data in a continuous stream, and it resembles the way a digital scope operates.
In the circular buffer mode, the head and tail each wrap to the buffer start when they reach the end. If the head catches up to the tail pointer, the buffer is considered full and acquisition stops. This mode is useful in applications that must acquire data with no sample loss. Data acquisition continues until either a predefined trigger condition or the application stops the driver. If the application can’t keep up with the acquisition process and the buffer overflows, the driver halts the acquisition and reports an error condition.
The recycled mode resembles the circular buffer mode except that when the head catches up with the tail pointer, it automatically increments the tail to the next frame boundary. As the buffer fills up, the driver is free to recycle frames, automatically incrementing the buffer tail.

While the ACB might seem a departure from earlier single and double buffer schemes, it’s actually a superset of them. In the single buffer mode, the ACB operates like a single buffer. Configured as a circular buffer with two frames, it behaves like a double buffer.

With multiple frames, the ACB can function in algorithms designed for buffer queues. The only limitation is that the logical buffers in the queues can’t be dynamically allocated or freed and their order is fixed.

As powerful as the ring buffer is, not all vendors support it, and one reason concerns backward compatibility with 16-bit drivers and applications. In fact, one vendor freely admits that its ACB driver won’t run under DOS or Windows 3.1. But losing the capability to run on those older environments is a small price to pay for the significant performance improvement.

Acknowledgement

The author thanks Boris Shajenko, the chief technology officer at United Electronic Industries, Watertown, MA, for his assistance in the preparation of this article.

For more information on this topic, see “The Never-Ending Quest For Performance.”

About the Author

Paul G. Schreier is a marketing consultant in the fields of data acquisition and DSP. He was the founding editor of Personal Engineering & Instrumentation News and previously served as chief editor of EDN magazine. 25 Washington Rd., Rye, NH 03870, (603) 427-1377, e-mail: [email protected].

Return to EE Home Page

Published by EE-Evaluation Engineering
All contents © 2000 Nelson Publishing Inc.
No reprint, distribution, or reuse in any medium is permitted
without the express written consent of the publisher.

February 2000