To view a PDF version of this article, click here.
Demand for high-availability Internet Protocol routers is being spurred by the need to reinforce the resilience of wide-area network edge aggregation devices, which usually serve large volumes of users and therefore can have a far-reaching impact on network service levels. Users require very high uptime for their mission-critical traffic, as well as for emerging real-time applications, such as voice-over-IP, now joining the network.
Systems designers, therefore, are challenged to deliver system uptime equivalent to, or better than, the reliability of the public switched telephone network (PSTN). This requirement for "five-nines" availability-a system that is processing and forwarding packets at least 99.999 percent of the time-is more pressing at the network edge, where redundant user access links are scarce. In the core, redundant systems connected by meshed links can leverage IP's inherent automatic rerouting capabilities in the event of a failure. This is a critical difference between IP and the circuit-switched PSTN.
Part 1 of this article explored several of the design aspects important to improving router availability, or the percentage of time a router is actually processing and forwarding packets. As noted, a high-availability design requires balancing variables such as system cost, complexity and network service-level goals. Now, in Part 2, we will take a closer look at the software design factors in the high-availability equation.
A highly redundant hardware platform coupled with fast software-recovery techniques boosts the availability of an IP router. As such, system availability encompasses two primary measurements: 1) the router's mean time between failures (MTBF), a measurement of the total time the device is in operation; and 2) the router's mean time to recover (MTTR) from an outage. This is the amount of time the system is not processing and forwarding packets. MTBF divided by the sum of MTBF and MTTR times 100 percent delivers the percentage availability of a given system.
Achieving 99.999 percent uptime or higher is quickly becoming a router design goal. A basic requirement for any highly available IP router is a design that includes completely redundant hardware components. Redundancy allows a router that experiences a failure to switch over to a backup component and recover. In addition to a redundant hardware platform, a software architecture that provides fast or seamless recovery when a switchover occurs is required for lowering MTTR.
Recovery Requirements
To reduce MTTR, router software must be able to either pass some amount of router configuration and link-state information to a standby component or recover from another source. The following types of information must be available on the standby to facilitate graceful system recovery:
- Static router-configuration information
- Link-state information
- Protocol-specific information
- Routing database (or the system must be able to reconstruct the routing database using information from peer routers)
- Other dynamic system information (for example, Simple Network Management Protocol sysUptime must be a monotonically increasing value, even after a switchover)
Several key characteristics of system behavior must also be preserved to minimize the impact of a route processor (RP) failure:
- Physical-layer connections must continue to operate independently of the RP.
- Layer 2 connections must not time out during the recovery process of the now-active RP.
- Layer 3 routing protocols must not time out and cause routing flaps.
- The software must continue to forward packets using the last known forwarding table on the line cards.
To help meet the last two goals, the Internet Engineering Task Force (IETF) has built extensions to routing protocols such as the Border Gateway Protocol (BGP) that minimize the duration and reach of an outage associated with a failed RP (see the "Protocol extensions" section below). BGP is a particularly strong candidate for these high-availability protocol extensions, because it is deployed at the network edge.
The edge is the network segment that benefits most from highly available systems, because it is where many access circuits terminate and traffic is aggregated for forwarding across the wide-area network (WAN). In addition, for cost reasons, consumers and small offices commonly run a single circuit to the edge of a service provider network, leaving no alternative path for routing around a failed device.
Data Syncing Between RPs
There are several options for checkpointing, or synchronizing, large volumes of information between active and standby RPs. The design choice, again, requires balancing uptime requirements with cost, complexity, processing load and other considerations. Let's look at three of these synchronization options: message replay, full synchronization and partial synchronization.
Message replay. In this instance, every message or event generated by the active RP is replayed on the standby. The advantage of this type of synchronization is that it allows for deterministic state creation and propagation to the standby, in that the code on the standby is effectively running the same (or nearly the same) software as the active RP.
However, it is difficult to maintain and guarantee synchronization in a system using message replay. The code path that caused the active RP to fail will also cause the standby to crash, because the standby will follow the same code path.
Full synchronization. Here, every piece of data-including megabytes of routing database information-is synchronized to the standby RP. Having a complete database of full-state information available on the standby avoids "black holes" in forwarding during switchover. In this way, full synchronization speeds system recovery and boosts availability.
In the minus column, high data and transaction rates are required for full synchronization. Particular protocols such as Transmission Control Protocol (TCP) with its unique sequence numbers must be continually maintained and updated on the standby, a challenge that is difficult to handle in a deterministic manner.
Because of the large amounts of overhead and messaging generated by fully state-synchronized systems, such as those described in the previous two scenarios, these designs do not scale easily. A typical service provider edge system, for example, supports 20,000 Point-to-Point Protocol sessions and 200,000 BGP routes with 600 peers. With this load, continually synchronizing state information generates massive volumes of internal messaging, which is nonlinear in its growth-the overhead generated increases exponentially with the volume of messages. As a result, such systems might require specialized hardware and a very large bus bandwidth.
Partial synchronization. Using this option, the active and standby RPs synchronize selective information-enough information to maintain all Layer 1 and Layer 2 sessions, continue forwarding packets and recover the routing database from adjacent nodes. With this design option, there is less data to synchronize, so system consistency is easier to achieve. This is also a less processor-intensive approach, with no specialized hardware required, and is simple to implement. As such, this option scales very well as the number of routers, interfaces and sessions increases.
Note, though, that while switchover is occurring and the routing database is being rebuilt, the system continues to forward packets using the last forwarding database available. So this option carries the potential for black holes.
Protocol Extensions
While the standby RP is becoming active following an outage, the routing processes might not be fully functional, or there might be a period of time during which packet forwarding is not operational. To prevent the adjacent routers from declaring the failed router out of service and removing it from their routing tables and forwarding databases, the IETF has developed a set of routing protocol extensions. These extensions, when running in both the failed router and its peers, prevent routing flaps when a router is temporarily unavailable to share routing information but continues to forward packets while it recovers. The first protocol extension to become an IETF-Draft is "A Graceful Restart Mechanism for BGP," better known as "BGP restart."
BGP at the edge. One reason that BGP was targeted as one of the first routing protocols to receive high-availability extensions is that it has been designed to carry a very large number of routes, compared with other routing protocols. Convergence following a BGP software failure usually takes longer than with other routing protocols, resulting in an outage of longer duration.
In addition, BGP is typically deployed at the WAN edge, between the domains of different network operators. Because BGP advertises IP routes across multiple domains, the impact of a failed BGP process can propagate across two or more networks rather than being confined to a single domain. This results in additional network ramifications.
With BGP graceful restart enabled on an edge device and its peers, the data plane can continue to process and forward packets even if the control plane-which is responsible for determining best paths-fails. By also reducing routing flaps, graceful restart stabilizes the network and reduces the consumption of control plane resources.
High-availability extensions like BGP graceful restart are also in development for other routing protocols, such as intermediate system-intermediate system and open shortest path first.
How Graceful Restart Works
The software extensions must be deployed on the router that has experienced a failure, as well as that router's BGP peers. The peers help the system regain lost routing information and also help isolate failures from the rest of the network. The peers isolate failures by holding off propagating new network information for a short period of time while the router with a failed RP (hardware or software) recovers.
Graceful restart begins when the initial BGP connection between the edge router and its peers is established (Figure 1). The restarting router and peers signal to one another that they understand BGP graceful restart in their initial exchange of BGP "Open" messages. At that time, the edge router also provides its peers with a list of IP-based protocols for which it can maintain forwarding state across a BGP restart-for example, IPv4, IPv6, IP Multicast and multiprotocol label switching.
When the router recovers its BGP software, the TCP connection to the peer router is often cleared. Usually, this would cause the peer router to clear all routes associated with the restarting router. However, with BGP graceful restart enabled, the peer router marks all routes as "stale," but continues to use them to forward packets based on the expectation that the restarting router will re-establish the BGP session shortly. Likewise, the restarting router continues forwarding packets.
When the failed router opens the new BGP session, it will again send a BGP graceful restart acknowledgment to its peers. However, this time, it sets flags to let the peer router know that its BGP software has restarted.
While continuing to forward packets, the peer router will refresh the restarting router with any relevant BGP routing information base (RIB) updates. The peer signals that it has finished sending the updates with an "End-of-RIB" (EOR) marker. This is actually just an empty BGP "Update" message.
EOR markers help speed network convergence, because once the restarting router has received the markers from all peers, it knows it can begin best-path selection again using the new routing information. Similarly, the restarting router then sends any updates to its peer routers and uses the EOR marker to indicate completion of the process. Without the EOR marker, the restarting router would not know when the update was complete and, as a result, might wait longer than necessary to return to normal operations.
In systems that separate the control plane from the data plane, one set of functions can be swapped out while the other continues as usual. Separating these functions must be done in a way that enables customers to load new line cards with images and configurations into the router while the system continues to forward packets based on the best-path information it currently has.
By combining these software upgrade/downgrade techniques with the switchover capabilities for unplanned outages, a seamless planned software change can be achieved with minimal impact to network service levels.