Analysis of 10 Gbit/s, high quality-of-service, non-blocking switch fabrics has traditionally been focused on internal fabric architecture issues such as scheduling algorithms, queuing schemes, failover methodologies, and system-level topologies. As switch fabric performance continues to increase, a key bottleneck is often overlooked: the interface between the fabric and its traffic source. This interface must stream traffic efficiently at 10+ Gbit/s with good signal integrity while simplifying PCB design and supporting key fabric "musts" such as per-destination flow control and efficient multicast.
Recently announced work by the Network Processing Forum (NPF) promises to address these requirements. The NPF Streaming Interface (NPSI) leverages two existing 10 Gbit/s standards, SPI-4.2, and CSIX-L1, and fills voids in these standards that have limited fabric performance and interoperability.1 This article will compare these interfaces to determine which is best suited for high-performance fabrics.
What Fabrics Want
Figure 1 shows a block diagram of a typical switch fabric. In this diagram the fabric interface is represented by a redline.

Figure 1. Block diagram of a typical switch fabric. In this figure, the interface is shown in red.
The fabric interface must address three key requirements to avoid becoming the system performance bottleneck. These include:
- Datapath throughput: Common practice refers to a fabric port as being a well-known data rate (10 Gbps, OC-48 and so on). However, the fabric interface must provide at least 30 to 60 percent higher throughput. There are two main sources of overhead: the fabric interface header/trailer and the packet processor overhead used for in-band communication between line cards. The fabric interface header/trailer is pure overhead, while the packet processor overhead is treated as traffic by the fabric. Together, these two sources can be 14 bytes or more, a significant percentage when compared to a minimum-size IP packet of 40 bytes. Some interfaces have additional overhead in the form of padding bytes to ensure that each frame is the same width. This padding can require 20 percent additional throughput to compensate. Factoring all of this overhead in turns a "10-Gbit/s" fabric port into at least a 16-Gbit/s interface.
- Flow control throughput: A switch fabric "makes its bones" under congested conditions. Congestion events may cause fabric buffers to fill, requiring flow control message transmission across the fabric interface to halt traffic flow to the congested destinations. In large fabrics, flow control traffic may be substantial. If flow control traffic shares the datapath with regular unicast and multicast traffic, flow control messages can swamp the datapath, resulting in significant throughput degradation. A well-designed fabric interface, therefore, should maintain flow control throughput without degrading datapath throughput under high-congestion conditions.
- Flow control granularity: Today's high-performance fabrics can have thousands of possible destinations, identified by a tuple of {port, subport (a logical division of a port), class of service}. In order to avoid head-of-line (HOL) blocking, each destination must be independently managed within the fabric. If a destination is congested, the fabric will send a flow control message to the traffic sources. The less specific the flow control message, the more "innocent victim" traffic flows may be throttled along with the congested one. To avoid impacting innocent traffic flows, the flow control protocol across the fabric interface must be able to simultaneously scale down to fine granularity and up to thousands of possible destinations. Flow control latency can also relate to granularity; high latency increases the probability of sending lower-granularity messages in order to shut off congested traffic flows, which then impacts innocent victims more often.
Now that we've looked at the key capabilities required by the interface, let's examine how NPSI, SPI-4.2, and CSIX-L1 meet these challenges.
Datapath throughput
The switch fabric interface requirement for datapath throughput can be implemented in several ways. A relatively wide, single-ended bus is the most straightforward implementation. However, for a 10-Gbit/s (actually 16 Gbit/s) fabric interface, the datapath will be extremely wide and will still be uncomfortably fast for a single-ended interface using standard I/O technology.
Figure 2c shows the narrowest allowed CSIX-L1 implementation of a 10 Gbit/s datapath: 64 bits in each direction (not counting power, ground, and some out-of-band control signals). To achieve 16 Gbit/s of throughput, a 64-bit datapath must be clocked at 250 MHz. The HSTL2 signaling specified in CSIX-L1 supports this clock rate, but PCB layout with this wide, fast single-ended design is challenging, and results in a difficult signal integrity problem. Furthermore, the CSIX-L1 standard explicitly prohibits connectors on the switch fabric interface. Connectorizing the fabric interface is very useful for development systems and could one day become the norm in certain production systems, for ease of maintenance and hot swapping.
Click here for Figure 2
Figure 2: Switch fabric interface comparison showing datapath and flow control widths. CSIX-L1 flow control in (c) is in-band (part of the datapath). The flow control channel for NPSI and SPI-4.2 is labeled STAT in (a) and (b).
NPSI and SPI-4.2 (Figures 2a and 2b) take a different tack. Both use LVDS signaling with a separate clock (this is distinct from serdes links in which the clock is encoded with the data stream). In order to achieve 16 Gbit/s datapath throughput, NPSI and SPI-4.2 must be clocked at 500 MHz DDR, or 1 Gbit/s per LVDS signal pair (one bit).
Because of the LVDS signaling, each bit of the datapath is two wires, but this still results in only 32 wires in each direction compared to the 64 used for CSIX-L1. This is an important system-level design advantage for SPI-4.2 and NPSI. Although twice as fast in clock rate as CSIX-L1 (and DDR to boot), the NPSI/SPI-4.2 LVDS scheme will support a connector on the interface and places far fewer demands on PCB layout due to its differential nature and built-in receiver-side deskewing.
NPSI and SPI-4.2 deskew uses a built-in training sequence with user-selectable repetition rate and duration. This eliminates phase errors due to PCB traces of unequal lengths and is an important requirement for parallel buses (whether single-ended or differential) at high clock rates. Deskew capabilities of +/-1 bit time are common, allowing for substantially different trace lengths across the fabric interface.
The big difference between SPI-4.2 and NPSI on the data throughput front lies in multicast operation. SPI-4.2 is not as efficient as NPSI in the area of multicast support. SPI-4.2 was originally developed as a framer interface, which is a point-to-point application. Thus, SPI-4.2 has no inherent notion of multicast capability.
Vendors have developed proprietary solutions in which payload bytes are used to create the multicast frame header. However, strictly compliant SPI-4.2 implementations can support multicast only via traffic replication on the fabric interface. For a multicast fanout of N, N copies of the traffic must be sent across the SPI-4.2 interface to achieve the required fanout. This is a very inefficient use of fabric interface bandwidth.
By contrast, NPSI inherently supports multicast and flow control of multicast traffic is handled independently of unicast traffic of the same class of service. This latter feature can be especially useful for high-priority streaming traffic such as video, where one stream might be sent to multiple destinations. Flow control differences between each of the three interfaces will be discussed in the next section.
Flow Control Throughput
Turning to flow control, additional differences between the three fabric interface specifications emerge. As shown in Figure 2 above, CSIX-L1 has no out-of-band flow control (status) channel, instead requiring all flow control information to be carried in-band as part of the datapath. As mentioned earlier, the big disadvantage here is that substantial datapath bandwidth is lost to flow control. The advantage of this approach is that a separate flow control bus is not needed, but this flow control bus is quite narrow and the gain in datapath throughput under congested conditions is well worth this tradeoff.
NPSI and SPI-4.2 use the out-of-band flow control channel, which provides dedicated flow control bandwidth independent of the datapath. Unfortunately, SPI-4.2 allows two incompatible implementation choices for the flow control channel: LVDS at up to the full datapath clock rate or LVTTL at 25% of the datapath clock rate. If LVTTL is chosen, it can have a dramatic negative impact on flow control performance, as it restricts the flow control bandwidth to approximately 3 percent of the datapath bandwidth. This is adequate for many framer interfaces, but not for switch fabrics. If the LVDS choice is used instead (this is required in the NPSI standard), a 4X improvement in flow control bandwidth may be achieved. NPSI further improves the situation by optionally allowing the flow control bus width to be 4 bits instead of the 2 bits allowed in SPI-4.2.
SPI-4.2 is quite often used as the 10-Gbit/s interface between a SONET/SDH framer and a packet processor. From the packet processor point of view, the electrical similarity between NPSI and SPI-4.2 (assuming an LVDS flow control implementation for SPI-4.2) allows for a dual-purpose fabric/framer interface whose protocol is user-selectable. The fabric interface must run faster than the framer interface to account for overheads as described earlier.
The dual-purpose interface would be set to SPI-4.2 on the framer side, clocked at 311 MHz DDR, and another copy of the interface would be set to NPSI on the fabric side, and clocked at 450 MHz DDR or faster, depending on expected packet processor overheads. The packet processor IC design team now must deliver only one physical interface, albeit with two different protocols. If CSIX-L1 is the fabric interface, this dual-purpose interface benefit cannot be realized.
Granularity and Latency
Switch fabrics may generate a large number of flow control messages. Message granularity may range from a specific tuple of {port, subport, class-of-service} through a less-specific {port, subport}, to the least specific {port} or {class of service}. The less granular the message, the more "innocent victims" will be flow controlled, which leaves useful datapath bandwidth on the table. Optimal interface efficiency requires fine-grain flow control messages.
A related issue is flow control latency, which has two aspects: how often flow control messages may be sent, and how long it takes to act upon them once received. If flow control latency is high, there is a higher probability that many traffic flows will have to be throttled each time a message is sent, because the message will be less granular.
Both NPSI and CSIX-L1 support per-destination (also called directed) flow control messages. These messages can be sent immediately upon detection of congestion (or upon relief from congestion in the case of flow control off messages). CSIX-L1 supports 220 possible destinations, and NPSI is close behind at 216.
In addition to the flow control throughput limitation discussed earlier, CSIX-L1 has an additional subtle issue: turning on flow control (throttling traffic) is easier than turning it off. CSIX-L1 supports a wildcarding scheme in which several ports or classes of service may be flow controlled simultaneously. This is an efficient use of the flow control interface. However, ports or classes of service that have been flow controlled using wildcarding must have flow control deactivated via individual flow control messages. Since flow control will be deactivated, on average, as often as it's activated, CSIX-L1 wildcarding provides only a limited benefit and is difficult to use in practice.
NPSI avoids this issue by supporting bitmaps for directed flow control and employing a symmetrical means for flow control activation and deactivation. Bitmaps are updated when flow control status changes (regardless of message granularity) and are not required to be refreshed. This is known as "persistent" flow control.
Flow control persistence allows for very efficient use of flow control bandwidth. Deactivation of a low-granularity flow control condition (such as flow control of an entire fabric port) does not require retransmission of the high-granularity directed flow control status that existed before the entire port was throttled. Low frequency, background flow control refreshing may be used to address the possibility that a flow control message may have been sent but never received (or not received correctly).
SPI-4.2 uses a very different mechanism to achieve per-destination flow control a calendar. In this scheme, a calendar wheel rotates through the 28 possible SPI-4.2 destinations. When the wheel passes a given destination, any flow control messages related to that destination are sent.
SPI-4.2's calendar scheme has a major disadvantage in that if a destination becomes congested immediately after the wheel has passed it, the traffic source cannot be notified until the wheel comes around again. This increases average flow control latency, and therefore the probability that innocent victim flows will be impacted. The wheel mechanism also makes it difficult for congested destinations to take more flow control bandwidth as necessary. Although the wheel can be weighted to stop at certain destinations more often, it cannot be known in advance which destinations will be congested.
Final Thoughts
Looking at both datapath and flow control aspects of each interface, NPSI is the right choice for 10 Gbps fabric interfaces. It blends the best datapath and flow control electrical aspects of SPI-4.2 with the fabric-centric protocol of CSIX-L1 and delivers a whole greater than the sum of its parts. In future, the Network Processing Forum will likely turn its attention to NPSI enhancements, including both lower-speed operation (e.g. OC-48/2.5 Gbps) and higher-speed operation (OC-768) leveraging from SPI-5.
Author's Note: To view a chart comparing all three switch-fabric interfaces, click here.
References
- The Streaming Interface Implementation Agreement is available from the NPF at http://www.npforum.org/techinfo/HWStreamingIA.pdf. The CSIX-L1 specification is also controlled by the NPF, but is not being actively worked on: http://www.npforum.org/techinfo/CSIX.shtml. The SPI-4.2 Implementation Agreement is available from the OIF at http://www.oiforum.com/public/documents/OIF-SPI4-02.0.pdf.
- The HSTL specification is available from JEDEC at http://www.jedec.org/download/search/jesd8-6.pdf.
About the Author
Phil Brown is a senior member of technical staff at Tau Networks. Phil holds a BA in Engineering from Cambridge University (UK) and can be reached at philb@taunetworks.com.