Commsdesign Home Register About Commsdesign Feedback Online Opportunities SpecSearch GlobalSpec


















Audio Designline



eLibrary

EE TIMES NETWORK
 Online Editions
 EE TIMES
 EE TIMES ASIA
 EE TIMES CHINA
 EE TIMES FRANCE
 EE TIMES GERMANY
 EE TIMES INDIA
 EE TIMES JAPAN
 EE TIMES KOREA
 EE TIMES TAIWAN
 EE TIMES UK

 EE TIMES EUROPE
 ANALOG EUROPE
 INDUSTRIAL EUROPE
 AUTOMOTIVE DL EUROPE

 POWER DL EUROPE

 Web Sites
 • Audio DesignLine
 • Automotive DesignLine
 • Career Center
 • CommsDesign
 • Microwave
    Engineering
 • Deepchip.com
 • Design & Reuse
 • Digital Home DesignLine
 • DSP DesignLine
 • EDA DesignLine
 • Embedded.com
 • Elektronik i Norden
 • Industrial Control
    DesignLine
 • Planet Analog
 • Mobile Handset
    DesignLine
 • Power Management
    DesignLine
 • Programmable Logic
    DesignLine
 • RF DesignLine
 • RFID-World
 • Techonline
 • Video | Imaging
    DesignLine
 • Wireless Net
    DesignLine

ELECTRONICS GROUP SITES

 • eeProductCenter
 • Electronics Supply &
    Manufacturing
 • Conferences
    and Events
 • Electronics Supply &
    Manufacturing--China
 • Electronics Express
 • Webinars


24 July 2008



Taking a Graphical View of NPU Software Design

In OC-48 designs and beyond, a model that only connects existing software components is not sufficient for delivering line-rate performance. What's needed is a graphical view of software component interactions and packet flows.

By Ken Hines and Ross Ortega, Consystant
CommsDesign
Jan 03, 2003
Print This Story Send As Email Reprints
 
Rate this article
WORSE | BETTER
1 2 3 4 5
Developing high performance software applications that run on a network processor is a challenging endeavor. With bandwidth requirements continually increasing from OC-48 to OC-192 and beyond, network processors must be built to process several packets simultaneously by using a highly parallel architecture. Network processing applications must exploit this parallelism to achieve aggressive throughput requirements. Debugging and tuning the software components, once deployed to the network processor, requires a great deal of analysis and may necessitate a different allocation of the processing elements inside the network processor.

Complicating the development process is the increasingly intricate nature of network processing applications. While network processors are designed specifically for fast packet processing, these powerful parallel processors often feature additional hardware support for forwarding plane functionality. Designed for extremely high throughput, often with data rates beyond processor clock rates, forwarding plane software applies the same functionality independently to large numbers of packets in parallel

For example, processors in Intel's IXP line of network processors feature eight to 16 packet-processing elements (PPEs) [also called microengines], each of which contains eight hardware threads, special queues for receiving and transmitting packets, hardware state machines that specifically read from and write to the transmit and receive queues, and other hardware elements. In order to fully harness the power of these forwarding plane processors, it is critical for programmers to understand not only the instruction sets of the processors, but to understand how to rapidly synchronize concurrent threads, make full use of the on-chip hardware resources, and understand the corner cases of the specific application being built.

Furthermore, to build high performance solutions, network processor software developers must write their applications in microcode. Typically that means that developers must program one of the most sophisticated hardware architectures (NPUs) while using the most primitive software language.

The use of a software component model accelerates the development process by providing a framework that closely matches the pipeline nature of typical applications. But in order to debug and tune the software components once they are deployed to the network processor, developers must perform extensive analysis and modify allocation of the processing elements inside the network processor. The key to achieving this goal and reducing the effort needed to optimize performance is a tool that offers a comprehensive graphical view of software component interactions and packet flows.

Developing Forwarding Plane Apps
Forwarding plane applications involve repetitive operations on a series of packets. A typical application has ingress and egress phases, and packet processing can take place in either. In the ingress phase, a packet is read from a network, processed, and written out to a switch fabric. In the egress phase, a packet is read from the switch fabric, processed, and sent out to the network. Although this architecture typically only manipulates packet headers, it also offers the opportunity for "deep packet processing", where the contents of a packet are analyzed and manipulated as well.

A software team may write monolithic microcode to implement a software application for the forwarding plane. In the monolithic model, there are no clean separations between the various phases.

To simplify the overall software design process in the forwarding plane, many developers are adopting a software component model where reasonable divisions in functionality are established so software functionality can be distributed among several processors. This approach also facilitates reuse of common functionality across a number of designs.

A software component model should explicitly define the interaction behavior between the phases and components. In such an approach, many software components may be grouped together in a given phase. Modularity and reuse are the primary advantages of a component model. A well-defined software architecture can be created consisting of modular components. A software team can work individually on the various components and compose them to build the final application. Alternatively, existing components can be composed in different ways to create a new application

The need for Software Synthesis
In OC-48 designs and beyond, a model that only connects existing software components will not achieve the desired performance. The unique hardware resources available in a network processor must be fully exploited to achieve line rate. Thus, software synthesis technology is needed to generate optimized microcode when connecting software components that execute on the PPEs.

One such software synthesis approach is called coordination-centric design. Figure 1 shows a simple IPv4 forwarder application. This technology has been applied to Intel's microblock software component model. The connections between the components are called coordinators. A coordinator captures the interaction between components, but at this stage, does not dictate a particular implementation, which is deferred until the mapping stage. For example, if all of the software components are assigned to the same PPE, the coordinators will be implemented using local registers. However, if the components are assigned to different PPEs, a different communication mechanism will be needed.


Figure 1: Microblock diagram of a simple IP forwarder application. In this figure, the MB_POS_Receive component reads IP packets from a POS interface. The MB_IPV4_Fwder uses forwarding tables to determine the next hop for each packet, and the MB_CSIX_Transmit component transfers packets to the switch fabric connected to the network processor.

A major advantage of using coordination-centric design is that the software architecture is decoupled from the hardware architecture without sacrificing performance. The software application is initially created from software components without regard to which specific PPE, hardware thread, or communication mechanisms will be needed. The software team can focus on the functionality and correctness of the application and defer many performance issues until later in the design process.

Once a software application has been composed, it must be mapped to the hardware resources of the network processor. The power of coordinators can be clearly seen during the mapping phase.

When mapping a software application to an Intel IXP 2400 network processor, for example, the components must be assigned to groups and then the groups assigned to microengines and hardware threads. When a packet is passed between components in the same group, the coordinator between them is implemented using local registers. In the IXP 2400, PPEs that are next to each other have a fast communication link called a "next neighbor."

Scratch rings can also be used for communication between different PPEs. Scratch rings are not as fast as next neighbor links, but do not require the communicating microengines to be next to each other. When the groups are mapped to the microengines, the software developer can select the most appropriate communication channel for the implementation of the coordinator. Optimized line-rate microcode will be generated that implements the developer's decision.

Debugging and Performance Tuning
Many NPU vendors provide simulators so that software developers can begin the debugging process prior to running on the hardware. A simulator provides a stable environment to get detailed information during the execution of the software. Given the number of microengines and hardware threads, the software developer can quickly become overloaded with information.

A major limitation of the current simulation and debugging environments is that while they provide detailed information, this information tends to be in the form of individual register values, memory dumps, and resource utilization. Accordingly, the software developer must reconstruct system level behavior from this disparate data.

Based on the problems with current offerings, there is a strong need for design level visualization tools for system executions. These tools must show:

  • Design level components
  • Component level interactions including packet flow
  • System-level performance and component-level performance

Without such tools designers can spend days identifying problems that are visible in minutes using high-level visualization.

Figure 2 shows a commercial visualization tool running an application with six components mapped to an IXP2400. In this figure, system-level information available includes:

  • A trace for each component
  • Subdivision of traces to show how components are deployed over threads
  • Significant events (e.g., packet received, packet transmitted, state changed, etc.)
  • Evolution of control state
  • Interactions between components (e.g., data transferred)
  • Complete packet flows through all components
  • Execution durations

Figure 2 also illustrates one of the ways in which a bug could be isolated with these capabilities. This view shows a system analyzer trace where an incorrect data value is being passed from the forwarder to the receiver.


Figure 2: A visualization analyzer trace showing the data of a selected packet in detail.

Software Sensitivity
The performance of network processor software applications is extremely sensitive to how the software is mapped to the hardware. The majority of a designer's time is spent tuning the application to achieve line rate performance. Tuning involves assigning the software components to different pipeline stages, mapping data structures to various memories, and trying to exploit different communication mechanisms.

The power of coordination-centric design is clearly leveraged during this iterative performance tuning cycle. Instead of being forced to tediously modify numerous sections of microcode for each different mapping, using the GUI in Figure 4 below developers can focus on assigning the components and coordinators to the various hardware resources. Strong compiler technology leverages information about a particular assignment to generate high performance microcode.

Trial of a different mapping only requires a different assignment. New microcode is automatically generated that exploits optimization opportunities available by this particular mapping. The software developer can immediately see the performance implications of mapping decisions in the visualization tool. For example, assigning a component to either more hardware threads or microengines to fully utilize the parallelism available in the network processor can eliminate a bottleneck.

Figure 3a shows an execution of the POS_RX->IPV4 Forward->C6_TX system in which each component is mapped to a single thread in a single microengine. It is immediately apparent from this display that the forwarding component is the performance bottleneck for the entire system. The packet arrows are increasingly slanting toward the right, indicating longer and longer periods between when the RX component receives a packet, and when the forwarding component is ready to process it. Also, the execution time for the forwarding component shows that it is continuously busy.


Figure 3: A performance bottleneck eliminated by adding threads to IP forwarding.

Depending on how much of this time is spent accessing memory, how memory accesses line up, and how large the critical sections of this component are, the forwarding component bottleneck could be solved by either breaking the forwarding component down such that it can be mapped across a number of microengines. Designers can also solve the problem by simply increasing the number of threads on which it executes in the same microengine (each thread is assumed to be executing the same functionality on separate packets).

As shown in Figure 3b, deploying the IP forwarder over eight threads effectively removes it as the bottleneck. The visualizations helped both in identifying the performance bottleneck, and evaluating the quality of an attempt at eliminating it. In many cases, although counterintuitive, it is possible that increasing the number of threads can decrease performance due to the increased amount of synchronization required. Such an effect would be readily visible in the graphical display.

OC-48 Case Study
One of the most important considerations in building forwarding plane software is achieving line rate. As described in the previous section, performance is sensitive to a number of factors beyond the functionality of components, and in fact, even beyond the implementation of the algorithms.

In this section, we'll detail how mapping, synthesis, and visualization tools can be used to achieve OC-48 line rate (2.5 Gb/s) performance with an ingress forwarding application on Intel's IXP2400.Since packet processing is the part of the system, we'll devote a lot of attention to that topic.

Figure 4 shows an OC-48 that includes a POS receiver, a classifier, a forwarder, a packet scheduler, a queue manager for managing traffic injected into a CSIX switch fabric, and a CSIX transmit block for actually transferring the packets to the switch fabric.


Figure 4: Diagram showing a composed system, microengine mapping, and memory mapping.

Our first attempt at mapping the system to hardware allocates the microengines in order, with one microengine for receiver; the next four for a group containing the classifier, the forwarder and the queue manager; and the next three for the queue manager, the scheduler, and the CSIX transmitter. In this example, we mapped most of the tables to a block of fast memory, packet data to DRAM. In simulation, this mapping shows sub-linerate performance, reading data from the network at an average rate of 1.57 Gbit/s, and writing it out to the switch fabric at an average rate of 1.83 Gbit/s. In a practical system, this would mean that in bursty traffic, nearly half of the packets might be dropped at the network interface.

In our demo system, the forwarding block appeared to be the major bottleneck, and in retrospect, this mapping puts three of the microengines running this functionality in one microengine cluster, and only one in the other cluster (the IXP 2400 groups microengines into two clusters such that all microengines in a cluster share a bus). This means that the cluster 0 bus is likely to be overworked, and a substantial contributor to this poor performance.

With a few adjustments to mapping (e.g. changing the mapping such that half of the forwarding microengines are in each cluster, and rearranging the memory access) we achieved a performance of 2.42 Gbit/s average read rate, and 3.32 Gbit/s average write rate. These are both adequate to be considered line rate.

The amount of time it took a developer to experiment with different mapping options to reach line rate was on the order of hours. With hand-written code, the same task could take the developer weeks to complete.

Wrap Up
Network processors are designed to address exceptionally high throughput requirements in networking applications. As such, they are typically composed of an array of PPEs, among which the functionality must be distributed. Component models are used to simplify building the applications, where each component may be mapped to part or all of a PPE, or to multiple PPEs. This adds a level of abstraction between how an application is built, and the vendor's tools provided to analyze them.

Developing high performance software for network processors is greatly simplified with a software component model. Finding bugs in microcode and making full use of the power available within the network processor requires a high degree of visibility into how design decisions impact the overall performance. Coordination-centric design decouples the software architecture from the hardware architecture allowing software developers to quickly evaluate the performance tradeoffs of different mappings. The automatic generation of production-quality line-rate microcode coupled with a graphical visualization of system-level behavior provides a powerful environment for the development of increasingly complex network processing applications.

About the Authors
Ken Hines is co-founder, vice president and chief scientist at Consystant. He received his doctorate, master's and bachelor's degrees from the University of Washington. Ken can be reached at ken.hines@consystant.com.

Ross Ortega is co-founder, vice president and CTO at Consystant. He earned Ph.D. and MSCS degrees from the University of Washington and a BSEE degree from MIT. Ross can be reached at ross.ortega@consystant.com.




EE Times TechCareers
Search Jobs

Enter Keyword(s):


Function:


State:
  

Post Your Resume
-----------------
Employers Area
Most Recent Posts More career-related news, resources and job postings for technology professionals



Home  |  Register  |  About  |  Feedback  |  Contact