Caught between the promise of near-infinite bandwidth in the optical core and the ever-increasing speeds and port densities of the access infrastructure, today's edgeswitching systems are running out of steam. The traffic they have to aggregate and route onto the Internet backbone is doubling every 3 to 6 months, is transported across
multiple protocols, and is growing in complexity as new services proliferate. With this level of functionality required, more bottlenecks are occurring in access networks and
more headaches are arising in today's switching system architectures.
In addition to bottlenecks, today's edge access switch designers must also tackle quality of service (QoS) issues. With more data flowing through a system at faster rates,
designers must develop switching architectures that ensure the proper distribution of data through a system architecture in order to ensure that service is not lost.
To solve bottlenecks and minimize QoS concerns, designers of edge access switching systems must rethink how information travels through a system design. Reevaluating
switch fabric architectures is a good place to start.
A traditional view
A typical edge switch switching system design consists of a number of network processors interconnected by a switch fabric (See Figure 1). Traffic typically enters the switch
via an ingress processor, traverses the fabric, and exits from an egress processor.
Packets or cells arriving at the ingress ports are inspected by the processors to determine: (1) their intended destination; (2) the QoS they are to receive based on the traffic
type or a service-level agreement (SLA); and (3) any local modification they may need, such as encapsulation, time to live (TTL) modification, or encryption/decryption.
The QoS required for a particular flow of packets determines when each packet needs to be transmitted from the egress network processor. Ideally, the packets would be
transported from the ingress to egress processor without any incremental delay, and get queued at the egress processor until the traffic shaping algorithms determine the
appropriate time to forward them into the next segment of the network. This is termed the output-queuing model since queuing occurs only at the outputs of the switch.
An implication of this model is that all incoming packets must be delivered to their intended egress network processor without any delay at the ingress network processor,
even if all the incoming packets are intended for the same egress network processor. Each egress processor must accept packets from the fabric not just at the port line rate,
but also at the full aggregate bandwidth of the switch.
Sharing the load
The current generation of edge access switch fabrics employ a switching technique based on a shared-memory architecture. Developed when the capabilities of bus- and
ring-based switching architectures were exceeded, shared-memory switches contain a global memory into or out of which each line card can write or read. This
implementation fits the output-queuing model because all packets queued in the switch are accessible to any egress network processors as if they were in the processor's own
local memory.
The performance of such shared-memory devices is scaled by increasing the bandwidth of global memory housed in the system. Consequently, the scaleability depends largely
on the ability of the semiconductor industry to continue to create faster memory while still maintaining bus widths with reasonable IC pin counts.
However, memory improvements, which double roughly every 18 months, are not keeping pace with the growing bandwidth demands on the edge network. Each generation of
edge switch - there is a new one approximately every 18 months - must boost its capacity by a factor of four, while memory solutions typically only performance by a factor
of four, not two. Some of the difference can be made up by using memory devices with a wider array, but once the memory width reaches the size of the cells or packets being
transmitted, increased width is no longer helpful. Also, pin counts go up as buses get wider, to the point where packaging and layout become impractical. As a result,
shared-memory switch fabrics currently won't scale beyond 20 Gbps of total line-end bandwidth.
Alternative lifestyles
To compensate for the memory headaches encountered in today's edge switching system architectures, designers are turning to new models for their switch fabric
architectures. One of the more popular is the input-queuing model.
The input-queuing model eliminates the need for each egress network processor to gain access to all packets the moment they arrive. Instead, the fabric only provides a
transport between the ingress and egress network processors, allowing the processors to deliver packets a little faster than the port line rate.
As with the output-queuing model, there are a number of different implementations of input queuing. The two most often encountered, however, are the multistage interconnect
network (MIN) and the crossbar.
The MIN is essentially a structured network of smaller switches with a well-defined routing algorithm collapsed into a single fabric. However, these MINs can create almost as
many problems as they solve. Because there are multiple paths, multiple arbitration decisions, and multiple queuing stages to deal with (significantly more than the two
implied by the input-queuing model), delivering guaranteed QoS levels cost effectively becomes extremely difficult. Since data goes through multiple routes to get from an
input port to an output port, there can be serious latency problems that will make it difficult or impossible to handle time-sensitive traffic such as voice and streaming video
efficiently.
The MIN architecture is also expensive to implement because so many of the switch interconnects get used up internally. In a MIN architecture, about 20% of the interconnect
is available for connecting line ends, while 80% is used for moving data around internally. Since the cost of the silicon is closely related to the amount of I/O it provides, this
results in a much higher cost, for a given bandwidth, over a single-stage switch.
MINs do have a place in the switch fabric hierarchy and will continue to play a role as long as global bandwidth demand outstrips technology improvements. MINs can scale up
into tens or hundreds of terabits of bandwidth, so they are being used today to build multi-terabit switches for the carrier core.
It only makes sense to use MINs when the switch manufacturer is trying to support aggregate bandwidth that is higher than what can be delivered through single-stage switch
fabrics - without resorting to exotic and costly non-CMOS technologies.
The crossbar approach
Crossbar fabrics have long been recognized as potentially providing the best architecture for single-stage, high-bandwidth switches. These fabrics use space-division
multiplexing (SDM) to create a switching medium with a high degree of parallelism. Any data path need only sustain the bandwidth of a single switch port, so the aggregate
bandwidth of the crossbar fabric can be orders of magnitude higher than shared-memory or other single-source switching fabrics that use time-division multiplexing (TDM).
Latency can be low for crossbar switching, and it actually goes down as bandwidth goes up. Crossbar switching fabrics can scale from a few gigabits per second into the terabit
range.
To be successful, crossbar switch fabrics must transport data derived from a range of network types, including variable-length IP packets, ATM cells, and TDM byte streams.
In order to best manage the QoS, all these data types should be transported in optimally sized fixed-length fabric cells. Although this implies a need for segmentation and
reassembly (SAR), in practice it is a small cost to pay for the degree of QoS management it enables.
In a typical crossbar fabric, cells are queued on the input side of the switch fabric. The state of all the input queues is visible to the crossbar arbiter. On the basis of these
states, knowledge of the QoS required for each flow, and feedback from the egress network processors about the states of the output queues, the arbiter decides which connection
to make in the memory-less crossbar and thus determines the order in which cells get forwarded to their respective egress network processors.
In order to give the arbitration algorithms the greatest freedom and flexibility to manage the QoS and to maximize the efficiency of the fabric, the cells in the input queues are
presorted on the basis of destination address and class; cells requiring broadly similar QoS are placed in virtual output queues (VOQs). The QoS-aware arbitration algorithm
can then ensure that the output queues in the egress network processor are never starved of cells which may already be waiting in the input queues.
The intelligence test
Despite some of its advantages, crossbar switch fabrics still fall short of answering all the demands on today's system architects. Intelligence is one area that is problematic.
To deliver stronger edge switching solutions, designers not only need fabrics that effectively distribute data around a system, they need fabrics that can intelligently move data
throughout a system architecture. Traditional crossbar products fall short in delivering higher levels of intelligence in the edge switching architecture to improve QoS.
Intelligence is key in the edge network, because it represents the last opportunity to shape and optimize the traffic before it disappears into the "dumb" core. Edge switches
need to handle multiple protocols - including IP, ATM, frame relay, and TDM - and support cost-differentiated services.
Edge switches are protocol-agnostic, with individual line cards dedicated to specific types of traffic such as 10 Gigabit Ethernet or OC-48 packet over SONET (POS). The
difficulty of the arbitration and scheduling task increases exponentially as more line cards are added.
One solution would be to use distributed arbitration on each line card, but the arbiters must have some way of communicating with one another and coordinating their
switching decisions. This process will inevitably take more time than the required arbitration rate while introducing inefficiencies throughout the switch fabric.
Consequently, QoS efforts will suffer from less-than-optimal switching decisions.
Theoretically, this problem can be mitigated by providing more overspeed in the switch fabric - bandwidth in excess of what the line cards require. In practice, however, the
industry is already pushing the bandwidth envelope to the limit, so using a significant part of the fabric core bandwidth to compensate for fundamental inefficiencies in the
architecture is not a good solution.
Enter the global arbiter
A global arbiter can eliminate a lot of communication overhead and thus reduce latency by maximizing the width of the pipes in the switch fabric. By doing this, the arbiter
allows the crossbar switch fabric to turn into an intelligent switching device (See Figure 2).
In a crossbar switch fabric architecture, the global arbiter balances the QoS requirements of every individual cell in the fabric at wire speed. Because the arbiter has a global
view of the traffic, there is no need to waste core bandwidth on arbitration guesswork. Such a global arbiter can use the crossbar resources at better than 97% efficiency.
QoS, which is traditionally based on output queuing, can be delivered through an input-queuing model with a global arbiter. The best arbitration chips can look at all the
potential simultaneous flows - 1024 in a 32-port switch, with multiple traffic classifications to be dealt with on each of these I/O port combinations - and make switching
decisions once every 20 to 30 ns. This guarantees that QoS can be delivered across the entire switch, with time-sensitive traffic such as voice and video receiving guaranteed
bandwidth and bounded latency.
Such QoS capabilities also enable service providers to deliver metered bandwidth. Finally, global arbiters can segment traffic and charge for it at a granular level. For
example, a building local exchange carrier (BLEC) can move a switch into a large office building and use it to provision various types and levels of service to the different
tenants. The intelligent crossbar switch fabric can track who is using the bandwidth and for what purpose, and make switching decisions that fulfill the terms of sophisticated
service-level agreements.
But to achieve this level of functionality, the global arbiter must make fast decisions. Thus, it has to make one complete solution of the arbitration problem during every cell
period, which is only 20 to 30 ns in a typical OC-192-port switch with a reasonable degree of overspeed. This presents a daunting technical challenge. Fortunately, switching
IC manufacturers are stepping to the plate and can now deliver this level of functionality.
A tall order
In addition to bringing intelligence to switching fabrics, designers are also faced with achieving higher levels of integration in their switching architectures. Until recently,
switch manufacturers have built boxes containing two separate switch fabrics - one for TDM/SONET cross connection and one for IP/ATM. This approach, however, adds both
size and complexity to the switch. The switch fabrics have to be managed differently, and the manufacturer must often deal with multiple suppliers.
Given the space, size, and cooling constraints in edge facilities, switch manufacturers need a single, more versatile switch fabric that can handle both TDM/SONET and IP/ATM
traffic. It's a tall order, but a new generation of crossbar switch fabric technology is rising to the challenge. These switch fabrics can aggregate and route TDM, IP, and ATM
traffic simultaneously at hundreds of gigabits per second.
In addition to integrating IP, ATM, and TDM traffic, designers are also looking to integrate other functions on chip. For example, the serializers/deserializers (serdes)
components in today's edge access equipment designs today can be replaced with on-chip integrated transceivers. This gives the switch fabric a smaller footprint and
dramatically reduces its power and cooling requirements.
Integrating the serdes, however, creates new challenges for designers building edge access switching systems. One challenge is coming up with greater drive capability for long
backplane PCB traces, allowing for variable trace lengths, and eliminating the unwanted effects of placing multiple high-speed transceivers close together on the same silicon
die. Attempts to use traditional methods to integrate existing symmetric and asynchronous link architectures are highly suspect, and so far have failed to materialize on the
market.
An asymmetric approach
A new approach has been developed to solve the PCB trace problems being encountered by today's designers. This new approach employs an asymmetric and synchronous
architecture to improve performance in de-signs employing switch fabric ICs that combine serdes functionality on board.
Traditional serdes have trouble accommodating for the basic asymmetry of the switch environment. In a traditional switching system, at one end of the system there aren't
many serial links on the individual line cards. At the other end there are hundreds of serial links coming into a few chips in the fabric. It is at this end where considerations
such as power consumption and die area become critical.
To handle signal flow through the system, designers have traditionally employed a single transceiver that averages the needs of the two sides of the link and provides both ends
with the same capabilities. In a Gigabit Ethernet environment, this symmetrical design limits the number of transceivers that can be employed in the system architecture to
between four and eight.
Unfortunately, today's edge switching architectures require many dozens of transceivers at the fabric side of the link to properly operate in a high port environment.
Therefore, the symmetrical approach, which is typically employed in modern switch fabric architectures, falls far short in meeting the demands of today's switching system
architectures.
A better solution is to use an asymmetrical transceiver design that puts most of the intelligence at one end of the link. This makes the other end far more compact and power
efficient, resulting in dozens of transceivers that can fit on a single chip.
The basis of the asymmetric serial link is the master/slave nature of the phase-locked loops (PLLs) at each end of the link. Unlike traditional serial links that use power
hungry high-speed PLLs at each end of the link, an asymmetric link requires a PLL on only one side of the link.
By employing a special phase measurement and alignment technique, the slave end is synchronized to the master end each time the link is established. Many slave-end serial
links can now be put on the same die using only one PLL for all links at the slave side. This approach greatly reduces power consumption and eliminates the issues associated
with traditional serial links, like injection locking.
Overall, the asymmetric approach looks like a traditional asynchronous link on the master end, since it uses a PLL, but acts like synchronous link at the slave end. Thus, since
each end is different, it is now called an asymmetric link (not identical at each end).
The asymmetric architecture solves many headaches for today's designers. The greatest advancement is the efficient integration of the serdes on chip. By operating in both a
synchronous and asynchronous manner, the serdes can be married on the same die as the switch fabric IC, reducing component count and increasing performance in today's
switch fabric architectures.
Marek Piekarski is the manager of systems architecture at Power X Ltd. He received his BSc in electrical and electronic engineering from the University of Manchester and
can be reached at marek.piekarski@powerxnetworks.com.