Our business, campus, and home networks today are full of important traffic being moved from node to node as efficiently as possible. Unfortunately, a significant amount of the traffic in our networks is not what we might consider important. SPAM is consuming a large amount of network bandwidth today, but that’s just one of the many types of traffic creating havoc in our networks.
The “cost” of things like SPAM, worms, or viruses isn’t just the network bandwidth they consume, but also the valuable business time spent dealing with them at the end node. The network bandwidth consumption becomes an issue at the “choke points” that still exist in modern networks, such as access links. With the explosion of MP3 players as well as video, which is offered for download, peer-to-peer (P2P) traffic has also become a major portion of the data being moved within networks today. A few megabits per second of P2P traffic may not represent the majority of the traffic on a Fast Ethernet or Gigabit Ethernet network, but they certainly are across the DSL/T1 or other link between that network and the outside world.
In our digital age, there are also numerous mischievous and malicious types of traffic from computer viruses to malware to phishing e-mails, as well as denial of service attacks being unleashed on networks today. As you can see in Figure 1, the financial impact of virus attacks is enormous. This has spurred on multi-billion dollar spending aimed at cleaning the traffic on networks for security appliances, application acceleration appliances, and other network appliances, ultimately allowing the consumer to utilize the bandwidth more efficiently.
What all of these appliances have in common is the need to look further up the OSI (Open System Interconnection) stack beyond the layer 3 (IP) headers to the application protocol layer. Meanwhile, the needed performance levels are increasing faster than malicious traffic in the networks. Can the general-purpose CPUs that offer these services on networks today continue to deliver the performance necessary to prevent this malicious traffic from entering and propagating throughout networks?
Whether this higher layer processing is called “Application Aware” or “Content Aware,” the fact is that all of these security risks and new low-priority traffic are using the same layer 3, layer 4, and even some layer 7 network protocols as the high-priority traffic. Whether it’s P2P traffic using well-known ports (e.g., port 80) so that it can masquerade as Web traffic and pass through firewalls unhindered, or SPAM coming in the same SMTP (Simple Mail Transport Protocol) connection as your much-needed business and personal e-mail, this traffic can’t be differentiated without looking well into the application layer.
Many solutions on the market today can handle this problem, such as regular expression engines using deterministic finite-state automaton (DFA) or nondeterministic finite-state automaton (NFA) algorithms in conjunction with table searches to establish connection level details. Or specialized processing can be used to assess each pattern match along with any other information that’s been collected on the connection. However, all of this is being processed on general-purpose CPUs that aren’t optimal for these types of operations.
The most difficult application to detect involves the well-designed attacks that target protocol stacks and their deficiencies. They can cause significant damage to the networks and the data stored on the nodes in the network. Examples of these attacks include: a virus that attaches itself to an executable file in an e-mail and causes havoc on your computer; or a worm that uses your own system resources to multiply and spread to other clients. It can even be a Trojan Horse—it looks like a useful software tool, but when executed can create a backdoor into your computer, allowing access for malicious activity. Detecting any of these application-level attacks on a network node starts by monitoring each connection to that system, since it’s created to determine the need for further processing.
This monitoring requires tracking the protocol state of all connections and carrying some of that state further along for each connection that’s targeted for deeper application-level processing. Then each connection must be processed on a packet-by-packet basis, looking deeper into the packet for signatures that can indicate certain traffic types, alerting the software to be wary of that connection, or looking for a combination of signatures that leads to detection of the particular attack and preventing further distribution to other network resources. Many of today’s appliances do this with software on general-purpose CPUs. However, they don’t have the necessary performance to process all of the traffic flowing through that network node. To perform this task in a network node, at the performance level required to process the amount of traffic necessary to secure today’s network bandwidths, you will need hardware acceleration for much of the pattern-matching packet processing. As a result, the CPU is able to play its part effectively.
Hardware Acceleration Can Solve the Problem
Using this targeted hardware acceleration for certain features can increase performance levels from the hundreds of megabits per second to multiple gigabits per second. It also frees up general-purpose CPU cycles, which are better applied to other functions that require greater flexibility.
The targeted hardware is able to better handle the functionality at higher performance levels than software on a CPU. As long as the software can easily utilize the features and capabilities of this hardware, then the increased performance comes at little software-development expense. Such a performance increase allows network systems to process packets at much higher layers on a much higher percentage of the traffic while still maintaining full performance levels. In addition, the greater performance will allow network systems to offer network Quality of Service (QoS), application protocol acceleration, intrusion prevention, anti-virus, malware, and SPAM detection and prevention at many places in the network.
For example, in a typical network, each PC and server has its own anti-virus software. Therefore, without central network anti-virus (AV) devices, it will take days if not weeks to update all 20,000+ PCs and servers with the latest operating system patches and new signatures for a 20,000+-person company.
Figure 2 offers a different solution. You can add a separate demilitarized zone (DMZ) with a server running AV software. This software would work with the firewall/VPN and the regular mail server that fill the needs of the client. In the future, you may be able to replace the firewall/VPN (virtual private network) with a universal threat management/integrated services router (UTM/ISR) that adds AV support. With this arrangement, it will take only minutes to update the handful of network AV devices (Fig. 3). (Note: DMZ is a term borrowed by the IT industry to refer to a network area that sits between an organization's internal network and an external network, usually the Internet. Usually, externally accessible services are placed in the DMZ. A UTM/ISR is a network security device with all or most of the following functionalities: routing, firewall, VPN, intrusion detection/prevention, anti-virus, anti-SPAM, and content filtering.)
One example of an offloadable task to hardware involves the classification looks that need to be done to determine what further processing (e.g., pattern matching) should be applied to a particular packet. For instance, Freescale’s MPC8572E PowerQUICC III processor does this using a “Table Look Up” engine, which provides exact match hash lookups for flow classification, and supports Access Control List (ACL) features for more general policy determination. The ACL lookup on the first packet can help identify the possible or likely protocols that might be carried over this connection. Subsequent packets on that connection can be classified using exact match lookups of the IP addresses and layer 4 port numbers to establish the necessary processing for that flow/connection. Software uses the result of the flow-level classification to determine what services and/or higher layer processing—like pattern matching—is necessary for that particular flow.
Once the flow’s protocol is determined, the state of that flow, as well as any subsequent higher layer processing determined for that flow, can be followed. Much of this subsequent processing (once again, pattern matching) can be accelerated in hardware. When the pattern matching finds an interesting result, then the software can be used to qualify that result, and make the decision on what to do next.
These decisions can relate to anything: providing a specific QoS for that traffic; making sure the higher-priority applications are getting the needed bandwidth; attacking detection and preventions; or stopping an attack at the first network node by discarding instead of allowing it to pass. In some cases, pattern matching can search for data within the packets to determine what kind of traffic it is, and then extract the necessary port information to help create new voice connections in the network. It can also identify ports being used for P2P traffic, which is very low priority traffic that consumes a large amount of network bandwidth and assets.
Implementation Example
As a concrete example, let’s examine the basic operations of an application-aware traffic manager. We’ll take a look at how they can be implemented and accelerated using the MPC8572E processor, specially designed for high-performance, application-aware networking and content security.
The primary purpose of the traffic manager is to classify packet flows based on the application carried, and then apply QoS to the flows. The problem is that some applications, notably P2P, often disguise themselves by:
- allowing users to change the default port#
- use a random port#
- use a port# that belongs to other applications, e.g., HTTP’s port 80
Matching a packet flow’s payload against these and other Regular Expression (RegEx) patterns is very CPU-intensive. It’s been reported that “when all 70 protocol filters are enabled…the system throughput drops to less than 10 Mbits/s. Moreover, over 90% of the CPU time is spent in regular expression matching, leaving little time for other…functions.”
In the MPC8572E, this CPU-intensive operation is offloaded to and accelerated by the integrated pattern matcher, greatly improving the system’s performance. In addition, the processor’s special hardware features help accelerate other operations in the application-aware traffic manager.
Today, much of the packet-level processing for security and routing in a host of network devices is being done in software on generic CPUs, or inflexible ASICs. There are many applications in which Moore’s Law doesn’t keep up with the performance needs as more and more attacks occur in networks.
Multi-core solutions can increase the performance, but it takes lots of software resources to optimize source code to take advantage of the multiple cores. However, CPU resources are still being wasted because they’re performing tasks that are better handled with targeted acceleration, such as hardware pattern matching, table lookups, and checksum validation as described in the example above.
The hardware acceleration now being added to SoCs will offload the CPUs from tasks they weren’t designed to perform, and allow them more cycles to add the value for which they were intended. This increased software performance gives the system designer more performance to differentiate their solutions in the market and increases the network security
performance of future systems, thereby further securing the high-priority data traveling through the networks of tomorrow.
Design Issues |
Actel |
Altera |
Lattice |
Xilinx |
Top Issues Application Engineers Deal With Daily |
JTAG testing or programming through TAP interface; integration and modification of IP cores; SSO mitigation |
Power consumption and performance optimization; debugging; interface complexity; signal Integrity; system complexity |
Meeting hardware and design timing closure; SERDES implementations; pc-board and FPGA noise management; multiple clock domains; power management and sequencing; configuration requirements |
Configuration accounts for about16% of inquiries, and embedded design issues account for about13% of inquiries; the top issues relating to the ISE design tools include mapping, place-and-route, Project Navigator, and the XST synthesis tool; additional challenges include dealing with memory interfaces, IP integration, and power management |
I/O signal assignment ordering |
Lay out dedicated I/O banks and keep SSO in mind while doing so; switching busses should be spread out across the die and away from PLL supply pins and asynchronous I/Os; then, assign I/Os within each bank; differential pairs should be assigned first followed by I/Os that require VREF |
No special pin assignment sequencing necessary, but do pay attention to each pin's capabilities as some are specialized for specific purposes, such as PCI Express |
Proper I/O floor planning is required; start with specialized I/Os, such as DDR2, then assign general I/Os, such as LVTT;.pPay special attention to multi-function pins, such as VREF, high-speed clock inputs, and PLL/DLL inputs |
Typically, FPGA pins that have the tightest constraints should be locked down first; a typical order for pin assignment might be: 1. input global/regional clocks and FPGA configuration pins; 2. MGT (SERDES), high-speed single ended interfaces, differential signals, and voltage reference pins as dictated by the I/O standard for a given region; 3. buses that require grouping into adjacent package pins on the FPGA for internal timing or pc-board layout; 4. slow signals like reset; there are several tools and online resources to assist with this process. |
Intrabank incompatible I/O standards, different voltage references, and other bank/region compatibility issues |
Make use of Designer software environment, which simplifies the process and provides guidance where necessary |
The goal is to make dealing with incompatiblities as easy as possible; to this end, Altera enables pins to support multiple I/O standards across multiple voltage; for example, a pin powered by a 2.5-V supply can accept 3.3-V inputs on most devices; also, most pins are designed to enable hot-socketing, where the FPGA can act as the interface on the board that is being plugged into a live system |
Plan ahead to avoid potential downstream pcboard issues and make use of ispLEVER I/O Assistant early in the design |
Derive two spreadsheets; first list all design I/Os and their electrical properties and preferred location on the package; then create classes/groups of signals sharing compatible IO power/reference voltage; second, list all the device I/O and their properties; sort that table by I/O bank; then filter out all dedicated or previously assigned I/Os from both spreadsheets; match I/Os with the appropriate I/O banks; there are several online resources available |
Dealing with FPGA to pc-board issues such as SSO, decoupling, routability, escape area, and escape planning with respect to signal layers and thermal issues |
Proper decoupling, termination, and layout on the pc board are important; however, it's best to prevent SSO altogether at the source by spreading fast-switching outputs across the die; the pins of QFPs have greater inductance than BGA packages; place sensitive I/Os near VCC or GND pads; create mini groups within a bus and stagger their outputs by more than 1 ns |
Check online for literature and seminars that address these topics; published escape routes are available |
Distribute the capacitive loads and pin assignment, optimize the drive current, control output slew rates, and reduce the output voltage swing where possible; decoupling examples based on eval boards and documentation are available; for BGA packages, use two signal layers for every two rows of balls; use software to analyze thermal issues. |
Refer to the following whitepaper on Xilinx.com: "Methodologies for Efficient FPGA Integration into PCBs " |
Handling of differential signals |
Use a standard termination scheme |
Make use of on-chip termination for LVDS and differentials and keep signals closely coupled; online documentation is available |
Detailed ac and dc coupled termination schemes are provided for different differential standards; apply differential pc-board layout rules to keep equal trace lengths between positive and negative traces and between data channels of the transmit and receive side of the differential interface; see online documentation for more details |
Refer to the following whitepaper on Xilinx.com: "Transmitting DDR Data Between LVDS and RocketIO CML Devices " |
Handling of various clocking schemes |
Use the Libero IDE 7.3 tool's block-based design methodology for clock distribution to instantiate a global clock placeholder (CLKINT); then the global buffer can be built into the top level of the design |
Their tools provide automated methods of choosing which clocks to use; online documentation is available |
Use global clocks with low skew to reach any device register; when shorter injection times or smaller skews are required, use local clocks; for the best possible injection, setup, and hold times, use edge (I/O) clocks |
Virtex-5 devices contain clock management to address complex timing requirements; refer to Xilinx's online resources for more on these issues |
Combining IP blocks and IP shopping |
Using IP blocks can create natural boundaries that limit what automation tools can optimize, yet may be useful for debugging and can aid with an incremental design flow by limiting changes; for third-party IP, ensure it was designed for the FPGA architecture you are targeting and whether it is efficient in size and performance; good IP also comes with thorough testbenches and high-quality documentation; you may also want to check IP core heritage along with the supplier before you commit usage |
Use SOPC Builder , Altera's automated system development tool, to tie IP blocks together and build a component; there is more information online |
To minimize timing closure issues, check that proper timing analysis, timing simulation, and hardware validation were performed on purchased blocks, or you will need to do this yourself; ensure that each block has adequate timing margin for the target FPGA; use the IPexpress tool to "test drive" IP in hardware; IP cores should be built on some form of a common Control Plane Interface (CPI) |
The main challenge in combining IP blocks is ensuring timing and resource requirements are met; when shopping for IP, confirm how the IP vendor has verified and validated the IP with respect to quality and ease of use; minor challenges in combining IP blocks are primarily caused by subtle differences in what is delivered by the IP vendor and the format of those deliverables |
Handling of timing issues causing the biggest problems |
Min delay and hold-time analysis are often overlooked; external hold-time and cross-clock domain paths tend to cause most timing-related issues resulting in hardware failures; perform both back-annotated timing simulation, static timing analysis, and simulation for functional verification, and note that static analysis provides the best timing coverage; the SmartTime timing analyzer allows for entering of external input and output delays and then performs the calculations |
For on-chip timing, use caution with the FMax parameter; the TimeQuest timing analyzer and physical synthesis helps; For off-chip timing, source synchronous used for many interfaces with LVDS can be problematic; TimeQuest can also be used for source-synchronous interfaces |
Timing windows are getting smaller with higher frequencies; clock domain transfer at high speeds, race conditions, and hold time violation are the most problematic; a combination of careful timing analysis along with using the ispLEVER tool can help identify the areas and help resolve the issues; due to the extremely high performance of the Lattice FPGA fabric, the probability of hold time violations has significantly increased |
No answer provided |
Modifications to tool suite to help overcome the above issues |
Today, designers can implement fairly complex system-on-a-chip on modern FPGAs, and they require features from traditional logic design tools, as well as those from analog, DSP, and processor design tools; to support cross-functional engineering disciplines, their tools integrate the look and feel of tools from each discipline |
The quality of the tools is a key focus area with the goal of providing easy-to-use tools that deliver a high quality of results; this includes emphasizing the eliminating of internal errors and crashes and reducing compile time and PC memory usage; PowerPlay power analysis and optimization technology enables designers to accurately analyze and optimize both dynamic and static power consumption |
Power, SSO, and FPGA constraint planning, runtime/turnaround time improvements, ispTRACY debugging tool, pre-built IP, RTL Analysis and improvements, and documentation (HDL Explorer) |
No answer provided |