Commsdesign Home Register About Commsdesign Feedback Online Opportunities SpecSearch GlobalSpec




















eLibrary

EE TIMES NETWORK
 Online Editions
 EE TIMES
 EE TIMES ASIA
 EE TIMES CHINA
 EE TIMES FRANCE
 EE TIMES GERMANY
 EE TIMES INDIA
 EE TIMES JAPAN
 EE TIMES KOREA
 EE TIMES TAIWAN
 EE TIMES UK

 EE TIMES EUROPE
 ANALOG EUROPE
 INDUSTRIAL EUROPE
 AUTOMOTIVE DL EUROPE

 POWER DL EUROPE

 Web Sites
 • Audio DesignLine
 • Automotive DesignLine
 • Career Center
 • CommsDesign
 • Microwave
    Engineering
 • Deepchip.com
 • Design & Reuse
 • Digital Home DesignLine
 • DSP DesignLine
 • EDA DesignLine
 • Embedded.com
 • Elektronik i Norden
 • Green SupplyLine
 • Industrial Control
    DesignLine
 • Planet Analog
 • Mobile Handset
    DesignLine
 • Power Management
    DesignLine
 • Programmable Logic
    DesignLine
 • RF DesignLine
 • RFID-World
 • Techonline
 • Video | Imaging
    DesignLine
 • Wireless Net
    DesignLine

ELECTRONICS GROUP SITES

 • eeProductCenter
 • Electronics Supply &
    Manufacturing
 • Conferences
    and Events
 • Electronics Supply &
    Manufacturing--China
 • Electronics Express
 • Webinars


09 February 2010



Shared-Memory Fabrics Meet 10-Gbit Backplane Demands

Questions are arising in the sector about a crossbar fabric's ability to provide the scalability and cost required in today's high-seed networking equipment designs. With that in mind, designers should consider making the shift to shared-memory fabric architectures.

By Andreas D. Bovopoulos and Micha Zeiger, TeraChip, Inc.
CommsDesign
Apr 23, 2003
Print This Story Send As Email Reprints
 
As engineers look for new ways to enhance and accelerate their packet processing capabilities in networking boxes, attention must be focused on the switch fabric architecture. Traditionally, designers have turned to crossbar fabric approaches in their networking designs. While effective, these fabric architectures require too many components and do not provide the scalability needed to thrive in 10- and 40-Gbit/s box architectures.

What designers need in the sector is a new strategy for their backplane designs. That strategy may emerge in the form of shared-memory fabric architectures that take the queuing and switching functions and centralize them on the switch fabric card.

In this article, we'll compare crossbar architectures with shared-memory architectures. We'll then show the benefits that the shared-memory approach can bring to a networking design.

The Birth of the Single-Stage Fabric
Traditionally, designers have implemented so-called "three-stage" fabric architectures in their networking designs. In these architectures, the first and third stage of switching occur on the line card while the second stage of switching occurs on a fabric card. So while these architectures are called three-stage approaches, in essence they are single-stage architectures.

The single-stage fabrics that make up many of today's communication architectures are built around work done by Prof. Nick McKeown of Stanford University. Prof. McKeown with the Tiny Tera Project (http://tiny-tera.stanford.edu/tiny-tera/), demonstrated that a very high-bandwidth switch could be built using single stage switching and input queues that implement a virtual output queuing architecture. By employing novel scheduling algorithms for both unicast and multicast traffic, Prof. McKeown demonstrated that such architecture could achieve throughput close to 100%. Within a few years of the publication of his work, single-stage switching architectures attracted much attention, and both systems and semiconductor companies began developing products using this architecture.

As Figure 1 shows, a typical single-stage system consists of line cards communicating with switching cards through a high speed interconnect, with the additional constraint that the switching card implements a single stage switching fabric. In this paper, the interface element on the line cards is called the line card Interface (LCI). The switching fabric card implements a single stage switching fabric (SF).


Figure 1: Diagram of a typical single-stage switching architecture.

LCIs typically communicate on the line card with packet processing elements, which, for simplicity, are referred to as the packet-processing module (PPM). The PPM is typically composed of a network processor unit (NPU), which performs the forwarding, classification, and optional tagging functions. In high-end systems the NPU is followed by a specialized traffic manager (TM), which, as its name implies, focuses on streamlining the interfacing to the switching fabric by, among other things, implementing traffic management and segmentation and reassembly.

The LCI-to-SF connections are high-speed (2.5 Gbit/s and higher) differential serial signals that usually carry a proprietary protocol. The LCI-to-PPM interface is standards-based, given that typically the interfacing chips are multi-vendor and interface through the CSIX level 1 (L1) interface or the recently specified Network Processing Forum Streaming Interface (NPSI).

SFs can be built using either crossbar switches or memory-based switching elements. The remainder of this paper discusses and compares the various options for building single-stage switching fabrics.

Bufferless Crossbars
Crossbar-based switching fabrics are characterized by the fact that they utilize memoryless crossbars to interconnect the various cards on the system. The lack of buffering in the SF is compensated for by the introduction of buffering on the line cards and, more specifically, on the LCIs.

Typically, designers use one of three buffering approaches on the line card: output buffering, input buffering, and combined input-output buffering. In all these architectures the simplicity and bufferless nature of the crossbar is compensated for by the use of a scheduler which tightly and centrally manages the movement of traffic between input and output buffers through the crossbar on a time slot-by-time slot basis. Thus, the scheduler tightly couples the operation of the crossbar-based SF and LCI on the line cards.

In output buffered fabrics, packets destined for different outputs are kept in separate output queues. Such switches use a scheduler to control the time at which packets or cells are switched through the switching fabric. With a properly designed scheduler, such a switch can provide QoS guarantees. However, the fabric and the memory of an NxN output buffer switch must operate N times faster than the rate of an input port. This becomes a problem at high line rates since memory with sufficient bandwidth is simply not available.

A switch operating with input buffering utilizes memory that operates as fast as the line rate. Because different inputs can compete for the same output, not all inputs with cells to send can transmit at once. This phenomenon is called head-of-line (HOL) blocking.

Simple analysis suggests that if each input of an NxN crossconnect has a cell with probability p, and if each input cell is independent and is addressed to an output with the same probability, then for any given output, the probability that i cells are addressed to it is:


On average, out of approximately pN input cells, approximately (1-e-p)N can be served. For p=1, approximately 0.63N are served. A more accurate analysis shows that for large N, the maximum crossconnect throughput is 0.58N per cycle.

Better performance is possible with the use of virtual output queues (VOQ) at the LCIs. As shown in Figure 2, VOQs are integrated into the LCIs. The scheduler receives VOQ status information and decides for the next cell time which ingress cells to transfer to egress ports. The crossbar makes a connection of ingress and egress ports according to the scheduler decision, and each ingress port must send a cell that is destined to the egress port as scheduled.


Figure 2: Typical switch fabric implementation using VOQs for input buffering.

A sample fabric with 16 ports and 8 classes of service has 128 VOQs per LCI. Each VOQ must hold at least one maximum-size packet. As a result, the total memory in a LCI is at least 200 kbytes. Such a memory increases the LCI die size, which immediately translates into cost. To reduce the memory size, the LCI could build the VOQs out of a shared-memory pool.

Buffered Crossbars
In crossbar-based systems, all the LCIs on the line cards and the crossbars on the switching cards must be synchronized to the centralized scheduler. At high speeds, this becomes a difficult task. Prof. McKeown proved that iSLIP, which is a typical scheduling algorithm used, requires O(log2n) iterations. This becomes a very difficult task for systems supporting a large number of VOQs and line rates of 10 Gbit/s and above.

One way to alleviate this scheduling problem is to introduce FIFOs— that is dual-ported memory—at each input-output crosspoint of the crossbar. This enhancement is functionally the same as introducing discrete output FIFO queues on every input line.

With these internal FIFO queues, the ingress-scheduling problem is simplified, as cells are sent from an LCI to the crossbar only if the receiving FIFO has space. Thus, on the ingress side of a buffered crossbar, there is no need for scheduling; a simple flow control mechanism suffices.

On the egress side of the buffered crossbar, the scheduling problem is also simplified. For each output port, the scheduler must schedule traffic among O(n) queues, whereas in a bufferless crossbar the scheduler must schedule traffic among O(n2) queues.

Nevertheless, the discrete nature of the buffering makes buffered crossbar switches memory intensive. For example, a 16x16 switch would require at least 256 FIFO buffers. The addition of multiple classes of traffic would linearly increase the number of required FIFOS. For example, the support of eight classes of traffic would increase the number of FIFOs to 2K.

In a buffered crossbar, the buffer size of each FIFO queue should be at least twice the amount of traffic that can arrive at a crossbar port over the roundtrip time delay of a control signal exchanged between the LCI and the crossbar on the SF. This roundtrip time delay is in the order of 600 to 800 ns. As a result, for a 10 Gbit/s line rate, the minimum size of each FIFO becomes 12000 to 16000 bits. With 256 queues, the buffered crossbar should support at least 4 Mbits of buffering.

The Shared-Memory Approach
Under normal traffic conditions, queues on the LCIs are rarely full. By moving the memory from the LCIs into the memory-based SF element, the same performance level can be achieved with less memory.

Figure 3 shows a single-stage switch using memory-based elements to build the SF. In memory-based SF elements, queues are implemented using linked-lists in common memory.


Figure 3: Typical switch fabric implementation using a memory-based fabric.

The memory on the SF requires a bandwidth equal to twice the external link bandwidth. If technology or cost reasons do not allow the use of such a high speed or such a wide memory bank, multiple memory banks inside the SF element may be used. Compared to the memory requirements of crossbar-based systems, the improved used of memory in a shared-memory based switch reduces the memory requirements of the fabric by an order of magnitude.

Stacking up Against the Crossbars
So how does the shared-memory approach stack up against the traditional crossbar methods? As discussed in previous sections, bufferless crossbar-based systems suffer from scheduling problems, while buffer-based crossbars suffer from inefficient use of memory. Both problems affect the cost scalability of such systems. The use of a centralized scheduler introduces both a single point of failure and design complications, especially when system resiliency requires the availability of backup switching fabrics.

Memory-based systems, on the other hand, use a flow control-based in-band control plane to eliminate the need for a centralized scheduler. Such systems are more robust, since the control of traffic is distributed among all the SF elements, which make their own local decision based on their partial view of the state of the system.

With memory-based switching fabric elements, the objective of the local traffic controller is simplified by the use of fairly easy-to-implement in-band flow control schemes. In such systems, when a particular flow is sent from the traffic manager to the LCI, in-band flow control messages are used to avoid buffer starvation and buffer overflow. In this context, the traffic shaping and flow control functionality of the traffic manager are critical to ensure overall system performance.

In memory-based SF systems with an in-band and distributed control, the timing requirements between the switch cards and the LCIs are less stringent than in a scheduler-driven crossbar switch. Since traffic crosses the switching fabric without the involvement of a central scheduler, the LCI can be simply a protocol translator and thus can be simple and small. Consequently, the LCI in a memory-based switch fabric is much simpler than an LCI in a crossbar-based switch fabric. The simplicity of the LCI allows the design of single chip LCIs even for line rates of 40 Gbps.

The streamlining of the design that is achieved through the use of an in-band control plane and shared memory switching elements has a direct impact on the cost of the under design system. By comparing Tables 1 and 2, designers can see that by utilizing memory-based SF elements, they can significantly reduce the chip count, especially given that the LCI design is drastically simplified. Given that cost and power consumption are tightly coupled to the number of chips used, the use of memory-based SF elements directly reduces the cost and power consumption of the overall system.

About the Authors
Andreas Bovopoulos is director of architecture at TeraChip. Prior to joining TeraChip, Bovopoulos served as founder and CEO of Aetian Networks, a VC-funded optical networking start-up in Fremont, CA. Prior to Aetian, Bovopoulos has held positions with PairGain Technologies, 3Com and Chipcom. Before joining Chipcom, Bovopoulos was an assistant professor of Computer Science at Washington University in St. Louis. He received his Ph.D. from Columbia University in electrical engineering and can be reached at andreas@tera-chip.com.

Micha Zeiger is founder, CEO, and CTO of TeraChip. Before founding TeraChip. Zeiger managed an ASIC design house called IC Component Design for seven years. Prior to that, Zeiger designed radar DSP ASICs for Israel's Defense Ministry and navigation systems for the Israel Air Force. He has a BSc in Electrical Engineering from the Technion, Israel Institute of Technology in Haifa, and an MBA from Tel Aviv University. Zeiger can be reached at micha@tera-chip.com.




EE Times TechCareers
Search Jobs

Enter Keyword(s):


Function:


State:
  

Post Your Resume
-----------------
Employers Area
Most Recent Posts
Ascension Health seeking Solutions Development Analyst in St. Louis, MO

National Semiconductor seeking Principal IC Design Engineer in Santa Clara, CA

Taylor Guitars seeking Sr. Web Designer in El Cajon, CA

Covidien seeking Hardware Manager in Boulder, CO

Sierra Nevada seeking Software Engineer in Hagerstown, MD

More career-related news, resources and job postings for technology professionals



Home  |  Register  |  About  |  Feedback  |  Contact   |  Site Map
All materials on this site Copyright © 2010 TechInsights, a Division of United Business Media LLC All rights reserved.
Privacy Statement ¦ Terms of Service