Commsdesign Home Register About Commsdesign Feedback Online Opportunities SpecSearch GlobalSpec




















eLibrary

EE TIMES NETWORK
 Online Editions
 EE TIMES
 EE TIMES ASIA
 EE TIMES CHINA
 EE TIMES FRANCE
 EE TIMES GERMANY
 EE TIMES INDIA
 EE TIMES JAPAN
 EE TIMES KOREA
 EE TIMES TAIWAN
 EE TIMES UK

 EE TIMES EUROPE
 ANALOG EUROPE
 INDUSTRIAL EUROPE
 AUTOMOTIVE DL EUROPE

 POWER DL EUROPE

 Web Sites
 • Audio DesignLine
 • Automotive DesignLine
 • Career Center
 • CommsDesign
 • Microwave
    Engineering
 • Deepchip.com
 • Design & Reuse
 • Digital Home DesignLine
 • DSP DesignLine
 • EDA DesignLine
 • Embedded.com
 • Elektronik i Norden
 • Green SupplyLine
 • Industrial Control
    DesignLine
 • Planet Analog
 • Mobile Handset
    DesignLine
 • Power Management
    DesignLine
 • Programmable Logic
    DesignLine
 • RF DesignLine
 • RFID-World
 • Techonline
 • Video | Imaging
    DesignLine
 • Wireless Net
    DesignLine

ELECTRONICS GROUP SITES

 • eeProductCenter
 • Electronics Supply &
    Manufacturing
 • Conferences
    and Events
 • Electronics Supply &
    Manufacturing--China
 • Electronics Express
 • Webinars


12 March 2010



Steering Your Way Through Net Processor Architectures

There are a flurry of 10-Gbps network processors on the market today. Here's a guide to steer your way through these architectures, so you can pick the best one for your networking system design.

By Scott Matheson, Silicon Access Networks
CommsDesign
Jul 24, 2002
Print This Story Send As Email Reprints
 
Choosing a network processor is like choosing a car. In order to be satisfied, you have to find the right feature sets that will meet your particular needs, or more specific the needs of your system architecture.

The challenge for designers, however, is knowing what feature sets to look for. There are a flurry of network processor options on the market today and more are coming as the networking market moves to the 10-Gb range. Each of these processors takes different approaches to packet processing tasks, thus forcing designers to make some tough decisions during the selection process.

When analyzing net processor architectures, the engineer's task is simplified by taking an "inside-out" approach. In other words, start by looking at the underlying heart of the architecture—the instruction set—then move out methodically towards the pins. Each step in the process will uncover strengths and weaknesses. With that in mind, this article will look at:

  1. Instruction set architecture (ISA)
  2. Net processor core implementation
  3. On-chip accelerators and other integration
  4. I/O

Let's start the discussion by looking at the ISA.

ISA—The First Delineator
The first differentiator between architectural approaches is in the instruction set architecture (ISA). In general, the ISAs fall into two camps: standard RISC architecture based (MIPS, ARC, etc.) or custom networking instruction set based.

RISC-based NPUs all share a common philosophy: throw enough MIPS at the packet-processing problem and the problem is solved. Unfortunately, this is a highly inefficient solution for networking equipment designs.

RISC processors evolved from the theory that only the most frequently used assembly instructions should be included in a processor's ISA. The fewer instructions the CPU designer needed to implement, the simpler the CPU design could be. This resulted in faster and cheaper processors.

The determination of which instructions to implement was answered by exhaustive studies of what assembly code high-level language compilers actually generated, combined with research into what low-level instructions were the fastest/easiest to implement. The problem from a networking standpoint is that RISC processors were designed based on studies of general-purpose programs (GUIs, spreadsheets, etc.), not networking data path applications. This distinction is extremely important.

Let's look at the DSP market for a simple analogy. General-purpose CPUs can certainly execute DSP algorithms. Many of today's PC modems, for example, rely upon the host CPU to perform some of the signal processing required. But you would never use an array of Pentium cores to solve a more complex DSP problem. DSPs evolved into an extremely specialized class of processor because of the distinct signal processing requirements of mobile phones and other DSP-enabled systems.

Similarly, it is highly inefficient to turn loose general-purpose RISC CPUs on the packet-processing problem. Just as DSPs need specialized instructions for signal processing algorithms (multiply/accumulate, circular addressing, etc.), network processors have processing needs that significantly deviate from general-purpose RISC engines.

RISC CPUs come with another bit of baggage in the way of programming language. C/C++ is by far and away the most popular programming language for general-purpose CPUs. Many RISC-based NPU vendors call the use of C and C++ as a big advantage for networking equipment developers. But, is it really?

While it is true that most programmers are comfortable with C, C is extremely clumsy with regards to typical data path operations. For example, the extraction of a bit field from the middle of a data word (e.g. a TCP port number) can take several lines of C for bit shifting, ANDing and ORing operations. These lines of C will in turn generate dozens of assembly instructions. Thus, this may not be the most efficient option for designers.

A better solution here is to analyze what the typical data path code is doing, and develop an ISA that is optimized to handle these data path functions. For example, Table 1 shows some of the optimizations that can be made, compared with what RISC processors typically offer.

Table 1: Comparison of RISC and Custom ISA Architectures

Function RISC ISA Requires... NP-specific ISA Optimization
Packet modification Multiple shifts, ANDs, ORs Single cycle bit oriented insert/extract
Packet parsing/search key creation Multiple shifts, ANDs, ORs Single cycle complex bit field oriented compare and branch
Code branching based on packet contents Multiple shifts, ANDs, ORs coupled with sequential compare-and-branch operations Single cycle complex bit field oriented compare and branch
Scanning forward/backward in a packet Complex pointer arithmetic Packet-relative register addressing

The performance gains from implementing a network specific ISA can be staggering. For example, an optimized instruction set has been developed that can parse the entire Ethernet (or PPP, IPv6, etc.) header of a packet and jump to the relevant processing code in just 2 clock cycles. In a RISC based engine, this same function could require dozens of cycles.

Programming language is one potential drawback to network processors using a custom ISA. First-generation network processors based on non-RISC ISAs were hampered by difficult-to-use assembly languages. Network processor vendors are addressing this problem by developing a C-like language that preserves the look and feel of C, but adds the elements required for packet processing applications.

Multiprocessing Details
The next area that designers should investigate is the implementation of the instruction set in hardware. Multi-processing and multi-threading are two implementation techniques that have become nearly ubiquitous in 10-Gbps class network processors. While the use of these architectural enhancements seems to put all products on an even basis, in fact there are critical issues to be considered.

Multiprocessing simply refers to a network processor IC housing multiple processor cores. For example, some 10-Gb processors will house up to 32 cores on a chip.

A critical decision for designers to consider is how the cores are deployed to work on packets. There are two common models here (see Figure 1):

  1. Run-to-Completion: In this model, a packet is assigned to a single net processor core and all processing on that packet is done by that core.
  2. Systolic: In this model, a given packet is handed from core to core until processing is complete. For example, core 1 might do Layer 2 processing, then hand the packet to core 2 for Layer 3 processing and so on. In some architectures, there are even different ISAs at different stages of the processing pipeline.

Figure 1: Net processor chip designers either implement run-to-completion or systolic architectures in their chip designs.

Systolic processing can be quite flexible for handling certain processing tasks. However, systolic processing requires efficient load balancing between stages, and extreme care must be taken to make sure one processing stage does not stall. Systolic architectures require complex load balancing and often code generation tools to effectively use the processing pipeline. And once balancing is completed, how does one guarantee in the field that an unanticipated packet, or future code upgrade, won't cause instability or severe performance degradation?

Another problem with systolic architecture is in the flexibility of the pipeline. Tunneled packets, for example, might have an IPv6 packet encapsulated in IPv4. In some systolic architectures, it would be difficult to pass the "newly discovered" IPv4 packet back to the appropriate network processor core for IPv4 processing. Most chips employing a systolic architecture can accomplish this task, but it will cause a "pipeline bubble" since an unexpected packet is getting re-injected into the systolic flow.. Pipeline bubbles occur when unexpected code has been re-inserted into the execution like a bubble in a fuel hose. When this happens, the processing engine's normal execution flow is interrupted and normal processing stalls.

By comparison run-to-completion offers a much simpler programming model since there is no concern over passing packets to other network processors, load balancing amongst the network processors, or worrying about interactions between packets. When coupled with multithreading, run-to-completion offers a near perfect solution for processing packets. In addition, run-to-completion offers an extremely simple programming model, as it appears to the programmer as a traditional single processor application.

Understanding Multithreading
Multithreading is a technique that benefits both run-to-completion and systolic architectures. In multithreading, a given network processor core has multiple register sets with each register set assigned to a specific thread of execution (typically a packet). Whenever a long latency operation occurs (an off-chip address lookup, for example), the net processor core swaps register sets and begins executing another thread (working on another packet) that is ready to run.

Here again, though, engineers need to be careful. Multithreading can be a programmer's nightmare if the threading is programmer visible as in many NPU architectures. Such architectures force the programmer to handle context swapping, register pushing and popping, etc. Designers need to carefully analyze amount of code handling required as well as the performance hit taken for this handling.

A better solution is to have the hardware logic inside the network processor to handle the multithreading. For example, under the hardware approach, the instruction fetch engine can become intelligent enough to spot upcoming thread change conditions and automatically fill the instruction pipeline with another thread's code. The result is a zero cycle context switch. In addition, the context switch is invisible to the programmer.

When dealing with multiprocessing and multithreading, designers often struggle with striking the right balance between cores and the number of contexts when choosing a net processor architecture. In other words, do you need 16 cores, with 4 contexts each, or 32 cores with 8 contexts each?

Clearly, it's not simply a matter of more is better. Choosing the balance must be done based on intensive code analysis and extensive system level simulation. For example, latency to coprocessors (or CAMs) is usually "buried" in the number of threads available. The more threads available, the more latency that can be absorbed. Conversely, too many NPUs and too many threads translates directly into too much cost for the network processor. So the balancing act is critical—be sure to ask your net processor vendors how they chose the implementation.

Choosing the Right Level of Integration
Integration is another important area for designers to evaluate when selecting a 10-Gbps net processor. Here, engineers may be tempted to look at integration simply from the cost standpoint, but performance and scalability can be severely impacted by failure to integrate key blocks.

One of the biggest performance limiters in network processing is memory bandwidth. At 10-Gbps line rates, you have approximately 40 ns between incoming packets. Thanks to multithreading and multiprocessing, we can spend longer than 40 ns on a packet, but each nanosecond beyond consumes another thread.

The point is, things must get done fast. If you have to go off-chip to a 32-bit (or even 64-bit) RAM, processing will suffer. A better option is to choose a network processor that integrates critical portions of the RAM. This allows the memory access to run at core speed, thus opening a wider bus for memory accesses.

The same problem occurs with packet buffering. At 10 Gbps, dedicated high-speed memories must be used for buffering. For example, at least one processor vendor makes use of 3 Rambus channels to provide buffering at 10 Gbps. Scaling such a design to 20 Gbps, or later 40 Gbps, can only be achieved by going wider since memory speeds don't scale at the same rate as network line speeds. A 40-Gbps chip using Rambus would require 12 channels using today's speeds.

A better solution is to integrate the packet buffer. Doing this solves several problems. First, power is reduced since we don't need multiple Rambus devices or QDR SRAM. Second, scaling is relatively simple since we can make the buses wider and the arrays deeper in next generation devices. And third, on-chip packet buffering reduces latency.

Choosing the number of network processor chips to implement is another big integration challenge for design engineers building 10-Gb equipment. Some vendor will deliver single-chip 10-Gb offerings while other provide multi-chip implementations.

Single-chip solutions have only one advantage: for a very specific application, say a 10 Gigabit Ethernet Layer 2/3 switch, an optimized single-chip solution could potentially be built. If you want the maximum architectural flexibility and the ability to scale in terms of performance while offering your customer the ability to differentiate, then single-chip is the wrong way to go.

Getting the Right I/Os
I/O capability also has a big impact on the selection of a network processor architecture. There are typically four types of I/O on network processors:

  • Packet I/O: For most net processors today this is the industry standard SPI 4.2 interface.
  • Coprocessor I/O: Most net processors make use of a semi-proprietary interface for connecting to look-aside coprocessors and CAMs. Though some implement the newly adopted Network Processing Forum Lookaside-1 standard (LA-1).
  • Memory I/O Most net processors allow connection to either SRAM or DRAM.
  • Control Plane I/O: This is the connection between the forwarding plane and control plane.

In terms of packet I/O, the most critical parameter is the number of SPI 4.2 interfaces supported. Each SPI interface can transfer up to 25 million packets per second (Mpps) (40 byte IP packets) through the network processor. Some vendors with a single SPI interface (one receive, one transmit) claim to process 50 Mpps. While this may be the processing power of the NPU cores, if you can't physically get 50 Mpps into the chip, then this number is irrelevant.

Coprocessor bandwidth and efficiency must also be carefully considered. Latency through coprocessors translates into potential performance bottlenecks. Raw bandwidth is the simplest measure here, but not the only one.

For example, the LA-1 standard from the NPF describes a QDR SRAM-like interface that has a bandwidth of about 6 Gbps per direction, plenty of bandwidth for lookups at 10-Gbps packet rates, theoretically. But the standard stops short of describing how to interleave lookups from multiple threads and how to return data when the lookup is done without processor intervention. So comparing just on bandwidth is not sufficient.

Memory bandwidth is perhaps the easiest metric to compare amongst processors, however there are still some hidden "gotchas". Some processors make use of external memory independent of the code that's actually running. For example, a network processor might need some external SRAM for state storage and may force your tables to share this same space. This reduces the effective bandwidth available for processing.

Use Your Own Code!
In the end, there is really only one way to effectively compare network processors: benchmark them with your own code. All vendors offer emulation environments for their products. Get a hold of them for each network processor you're evaluating and run some basic performance tests using your own code (using the vendor's nicely tuned code may mask problems they don't want you to see.) Basic testing should eliminate many potential vendors for your application. Then step it up a notch checking corner cases. And if your chosen processor is systolic, try doing a code upgrade and see how painful pipeline rebalancing is. Finally, find out how accurate the simulation is by comparing results against actual silicon. A few weeks spent with this kind of analysis can save you years of pain choosing the wrong architecture.

About the Author
Scott Matheson is an applications engineering manager at Silicon Access Networks. Scott holds a BS in Electrical Engineering from Rensselaer Polytechnic Institute in Troy, NY. He can be reached at scott.matheson@siliconaccess.com.




EE Times TechCareers
Search Jobs

Enter Keyword(s):


Function:


State:
  

Post Your Resume
-----------------
Employers Area
Most Recent Posts
Accenture seeking Project Management Team Lead in Charlotte, NC

Accenture seeking Software Engineer in Salt Lake City, UT

Boeing Company seeking Software Engineer in Herndon, VA

Switch and Data seeking Customer Solutions Engineer in Dallas, TX

Chart Industries seeking Sr. Developer in Cleveland, OH

More career-related news, resources and job postings for technology professionals



Home  |  Register  |  About  |  Feedback  |  Contact   |  Site Map
All materials on this site Copyright © 2010 EE Times Group, a Division of United Business Media LLC All rights reserved.
Privacy Statement ¦ Terms of Service