Layer 3 Switch Design
By Thayumanavan Sridhar
As the line blurs between Layer 2 and Layer 3 switching, the focus has been on fast IP, since the trend is toward Layer 3 switching for IP and Layer 2 switching for other protocols. The key is to perform
the most common cases in the hardware and to use software for error conditions and special cases.
In the internetworking world, switches and routers have been deployed for workgroup and enterprise connectivity. In the past, switches mainly operated at Layer 2 (they were extensions of bridges), while routers were clearly Layer 3 devices. Recently, the line has blurred and switches operating at Layer 3 are becoming more popular - these are known, not surprisingly, as Layer 3 switches.
For the purpose of
this discussion, Layer 3 switches are superfast routers that do Layer 3 forwarding in hardware. In this article, we will mainly discuss Layer 3 switching in the context of fast IP. The switches run routing protocols, such as open shortest path first (OSPF) or routing information protocol (RIP), to communicate with other Layer 3 switches or routers and to build their routing/forwarding tables. These tables are looked up to determine the route for an incoming packet.
Layer 3 switches are becoming popular
due to advances made in the design and development of ASICs for fast routing. However, there are a number of issues to be considered in the system design of such devices, because of the speeds at which these devices should perform. Some of the design issues and techniques have been addressed in the high-speed commercial routers, and this has helped in the natural evolution of Layer 3 switches.
Evolution
Forwarding decisions at Layer 3 are more complicated than at Layer 2. The process at
Layer 2 mainly involves learning the media access control (MAC) addresses from the MAC frames and forwarding a frame on to a port if the destination MAC address of that frame corresponds to a station address learned on that port. The MAC frame is not changed - in some cases even the cyclical redundancy check (CRC) is not regenerated by the switch to speed up processing.
At Layer 3, the forwarding decision involves determining if the packet is addressed to the router, decrementing the time-to-live (TTL),
recalculating the checksum, determining the "next hop" to send the packet out to (if the destination is not present on the local interface), etc. The mechanisms vary, depending upon the protocol routed at Layer 3, and involve a scenario more detailed than the Layer 2 case.
To perform fast Layer 3 forwarding in hardware, it is necessary to clearly delineate the functions to be performed. This is quite straightforward, but it is impossible to create an economically viable ASIC to deal with all the
protocols that current multiprotocol routers can handle. Another point about delineating the functions is that there is always a default (most common) path and a number of variations (including error conditions) that will be need to be specified.
To handle these, Layer 3 switch designers must compromise. IP is the most common among all Layer 3 protocols today. So, a number of Layer 3 switches today mainly perform IP switching at the hardware level and forward the other protocols at Layer 2 (i.e. bridge them).
Also, the default path is typically handled in hardware, while the variations are handled in software.
Layer 3 or Layer 2?
The old wisdom was "switch when you can, route when you must." In Layer 3 switches, we can switch at Layer 3 only if the Layer 3 protocol is supported, otherwise we need to switch at Layer 2. We will consider only Layer 3 IP switches in this article. In these switches, to determine if the MAC frame is an IP frame, we need to check the protocol type field in the MAC
header (bytes 13 and 14 for Ethernet encapsulation) or the subnetwork access protocol (SNAP) header (or bytes 21 and 22 for 802.3 SNAP encapsulation). Most Layer 3 switches are optimized to do this lookup in hardware so that the decision of whether to switch at Layer 2 or Layer 3 can be made very quickly. In fact, cut through switches do this lookup while the frame is being received.
Once the determination is done, the following rule is applied:
If (MAC Frame is an IP Frame)
Pass to IP
Switching Hardware
else
Pass to MAC Switching Logic
The MAC switching logic is Layer 2 processing based on virtual LANs (VLANs), spanning tree protocol, and MAC address learning. It is well understood, and we will not consider it further.
Within the IP switching logic (actually the IP forwarding logic in earlier routers), there are a number of criteria to consider. These involve filtering and access control, error frames, frames addressed to the router itself, etc. The actual forwarding
decision can be taken only after much of this preprocessing. While this yields a logically correct solution, it has its penalties in terms of speed, as well as making it difficult to do in hardware.
Typically, the frames addressed to the IP switch are for the IP stack on the switch for SNMP/Telnet/ HTTP, or ping access. Such frames are not to be forwarded, so they will skip the IP switching hardware and be sent to the CPU (read software) for processing. This decision cannot be made at Layer 2, since the
packets are all addressed to the Layer 3 switch-MAC address. Such frames are identified by checking the Layer 3 destination IP address of the frame and verifying it to be one of the addresses of the Layer 3 switch. This can be easily implemented in hardware.
A Layer 3 IP switch would typically have its own IP address configured into the hardware. The incoming frame's destination IP address has to be compared with one of these addresses, and if a match is found, then the frame is passed to the CPU for further
processing.
Filtering and policy decisions
One of the reasons for the power of routers is the extensive filtering capabilities they provide. For example, they can be configured to permit or deny traffic based on source/destination IP address, protocol (TCP, UDP), port number (Telnet, SNMP, etc.), or any combination of these factors. While this does involve some complicated rule definition on the part of the user, it is an extremely useful feature.
Layer 3 IP switches can be designed to
perform many of these functions, too. However, they cannot provide for all variables. It is difficult to specify a large number of filter conditions since the decision has to be made in hardware, and there is only a finite number of conditions that can be programmed into the hardware ASIC. The compromise is to specify only a subset of conditions for the filtering and firewall capabilities and process all the frames that match via software. The key is not to change the default. If some frames need to be
filtered, it is better to look at the qualifying frames via the CPU instead of building all the logic into the ASIC.
For example, consider traffic between the hosts whose IP addresses are 10.0.12.26 and 198.75.62.12. If we wish to filter out all the HTTP traffic between these two components, the rule would be as follows:
Src: 10.0.12.26 Dest: 198.75.62.12 Protocol: TCP Port: 80 (HTTP)
All frames are passed through a hashing mechanism while being received. This is a function of the source,
destination IP addresses, protocol, and port numbers. The filter has already been passed through this hashing function to provide a hash code. If there is a match (it may not be an exact match), then the frame is passed to the CPU for further processing. If it turns out that the frame is not to be filtered, it would have encountered a delay because it has been processed via the CPU instead of being switched directly through the Layer 3 IP switch fabric. This is an acceptable compromise, which avoids
subjecting all IP frames to the complex checks. This technique has been used quite effectively in a number of commercial Ethernet controllers when filtering MAC addresses. Obtaining the fields from the frame is a fairly straightforward task, except when IP options are involved.
The same technique can be used for queuing and priority within the switch. This topic is gaining interest as one of the techniques used in weighted fair queuing (see "Considering Gigabit Ethernet Switch Design" by Shirish Sathaye,
Communication Systems Design, November 1997, p. 36). Through hashing techniques, a packet can be grouped into a priority class, which can use weighted fair queuing for packet scheduling.
IP options
IP options are used in a number of networks for debugging purposes. As such, they form a very small percentage of the regular IP traffic flowing in today's networks. However, they need to be considered for forwarding decisions, and this section will discuss how Layer 3 switches can handle them.
The IP header has a minimum length of 20 bytes and can go up to 60 bytes in length. The extra length is for IP options, the most common ones being source route, record route, and time stamp. Most packets follow the 20-byte header rule, but for the others (their header length can be any number between 20 bytes and 60 bytes), the only condition is that it should be a multiple of four. It is difficult to build hardware to accommodate these ranges, so a useful design compromise is to simply assume that all IP
packets use 20-byte headers (
Figure 1
).
We had indicated in the previous section on filtering and prioritization that the TCP port numbers are commonly used in the filter rule. These port numbers are the first 4 bytes of the TCP header, which follows the IP header. For the hashing function specified earlier, these need to be obtained quickly via a hardware lookup on the frame - read that as being available at a fixed offset. The TCP header is always assumed to start at
Byte 21 for these functions. If there is a false match because of IP options, the frames are looked up by the CPU and processed correctly. If there is a false match for prioritization, then such frames take up some bandwidth, but as noted, they form a very small percentage of the regular traffic and so this will not be a serious problem.
Error handling and fragmentation
Since Layer 3 IP switches are just routers, they need to handle all error conditions the same as regular routers. For
example, if the switch cannot forward a packet due to an unreachable condition (route to the destination does not exist or is currently down), then it needs to generate an Internet control message protocol (ICMP) destination unreachable message. The same is true for the case when the switch decrements the TTL field in the IP header and it reaches 0. In that case, the switch needs to generate an ICMP time-exceeded error message to the source. It is obvious that all these ICMP messages can be generated by the CPU,
and that there is no need for moving any of these functions to the hardware.
IP packets are fragmented when the outgoing interface's maximum transmission unit (MTU) is less than the size of the IP packet. In the case of Layer 3 switches, they are all connected to LAN segments of the same type (mostly Ethernet). The MTU on all the interfaces is the same, so this fragmentation issue does not apply.
If the Layer 3 switches are also equipped with WAN ports, this gets to be an important issue. In such
cases, the fragmentation function and subsequent encapsulation (conforming to the WAN protocol used on that interface) will need to be performed outside the switching hardware function. By and large, this information is more common to build Layer 3 switches as pure LAN switches without changing the default MTUs on the LAN ports.
Lookups
This is the key to the fast processing of the IP packet and is also an area where there is considerable product differentiation. There are multiple ways to
speed up the processing here, so it helps to have an idea of what is to be done.
The network is extracted from the destination IP address of the packet and used for the lookup for forwarding. An entry in the forwarding table in a conventional router looks like this:
Destination Address
Subnet Mask
Next Hop IP Address
When the packet's destination IP address masked with the subnet mask equals the destination address, then there is a match and the next hop IP address is the
address to which the packet is to be forwarded. This may be a directly connected network or the address of a router to which the packet should be forwarded.
The structure presented is the first level information needed to forward a packet. However, it is not sufficient to actually send the packet out on the wire. Consider a case where the next hop is actually another router on one of the Ethernet interfaces of the Layer 3 IP switch. The switch needs to determine which interface the router lies on, along
with its MAC address, before it is able to forward the packet out to the router. This information is available via lookups into the interface table and the address resolution protocol (ARP) cache. We will revisit this in greater detail in a moment.
In a Layer 3 switch, the table can be maintained in fast RAM while the lookup can be done in hardware. The incoming frame, once it has passed the filtering and error tests, has its destination IP address extracted and masked with the subnet mask to check for
the match. Since this is a straightforward operation, it can be easily implemented in hardware. The key issue is determining the right entry. A brute force method would be to try to pass through entries one at a time and determine after masking if the result matches with the destination IP address in the entry. This is a time consuming process.
Current implementations tackle this in multiple ways. One approach is to store the entries in multiple hash tables (one method uses these hash tables - one for
Class A and one for Class B addresses) so that getting to the appropriate set of entries is faster. Here again, the hashing can be done very quickly in hardware to get to the appropriate entry, as specified in the section on filtering. Frequently, optimization of lookups follows the path of "hash then cache," i.e. look up the destination based on a hashing algorithm and then cache the entry if a large number of packets requires the use of this entry for forwarding.
We still need to obtain the next hop
interface and IP to MAC address mapping of the next hop router before we actually send the packet out on the wire. Instead of using the standard method of looking this up in a separate interface table and the ARP cache, implementations simply keep all this information in the forwarding table itself. This avoids processing logic and additional lookups, and allows a hardware-based implementation to obtain all the information needed to forward the frame by just one fast lookup, which helps speed up
implementations (Table 1).
Another method is to cache the most frequently accessed information so that a large number of packets benefit. For example, if most of the traffic is between A to B and C to D, the cache contains information about the routes used for forwarding packets between these nodes. This way, the packets between these nodes can be switched extremely quickly. Another method used in such cached information is the recalculation of the checksum of the decrementing of the TTL. Since the fields in the
(20-byte) IP header are already known a priori in these conversations (as long as the packets follow the same end-to-end path), it is easy to precalculate the checksum, new TTL, and also the MAC header for forwarding on the interface. So, when the lookup is completed, all information for forwarding is available and can be used to construct the frame in a speedy fashion.
Statistics and monitoring
Routers provide an extensive amount of statistics as part of their function. This information is
very useful for network managers for monitoring, determining trouble spots in the network, etc. Counters for incoming and outgoing frames, and errored frames are some of the key statistics elements. Correct update of these counters provides useful information for monitoring.
Routers have traditionally incremented these counters in software. Layer 3 switches cannot afford to do this, since the incrementing would have to be done in the main forwarding path and would cause a serious performance penalty. The
compromise is that some of the counters are implemented in hardware on the ASIC itself. The counters are incremented by the hardware while the packet is being processed, avoiding the overhead of a software-based increment. The counters have a tendency to overflow, due to the large number of packets being processed. This can be solved by having more silicon real estate allocated to incorporate larger counters, or by having software step in. The first option results in higher cost and is not preferred, since
the counter functionality is not key to the hardware ASIC.
The second option is an interesting one. We mentioned earlier that the CPU (read: software) handles error conditions and "not so common" cases. It is the same here. The hardware counters count the packets, and software is expected to read the values out within a certain period of time (or within a certain number of packets) to prevent overflows of the counter. Since the number of packets cannot be accurately estimated, worst-case conditions
are used. Consider a counter, which keeps track of the total number of incoming packets. If the counter can count 64,000 packets, and we have a rate of 14,880 packets per sec (maximum 10-Mbps Ethernet arrival rate for 64-byte frames) for a four port Ethernet switch, the counter can overflow in approximately 1+ sec. The software has to read this counter at least once every sec to get the current value. Overflows are signaled by the hardware via setting of a bit so that the software knows on a read of the
counter that it needs to adjust for the overflow.
Optimization
In this article, we have described some of the issues related to Layer 3 switch design. The focus has been on IP since the trend is to do Layer 3 switching for IP and Layer 2 switching for the other protocols. The key is to perform the most common cases in the hardware and try to use software for error conditions and special cases. Optimization, like caching and hash tables, has been used in many routers and is a good candidate for
Layer 3 switches.
Thayumanavan Sridhar is the director of engineering at Future Communications Software in Santa Clara, CA. He received his BE in electronics and communications engineering from the College of Engineering, Guindy, Anna University, Madras, India, and received his MSEE from the University of Texas at Austin. He can be reached at sridhar@futsoft.com.