|
High-Availability Systems Made Easy: Part 2
By Chuck Hill
In part one of this article, we looked at system architectures for high availability -- but that is only half the battle. The designer must also consider the software, and especially the software restart model, when developing high-availability architectures.
A true high-availability system is not complete unless all factors capable of causing system outages are considered. And while the mechanisms for modeling and predicting hardware behavior are fairly well understood, some other factors are harder to model and more difficult to predict. The data for these components are typically tracked on an historical basis, since systems do not normally achieve high availability at first deployment. Historical feedback and system refinements are critical to reaching high levels of availability.
Historically, software has been the second-largest factor in system outages, but it does not usually receive the attention that hardware does from a fault-management perspective. Also, software is more frequently changed and upgraded, making it more difficult to predict the long-term impact on system stability. With commercial off-the-shelf systems (COTS) depending on software for fault detection and fault management, software is more critical than ever before.
The chart shown in Figure 1 illustrates another important consideration. Hardware and software failures account for approximately two thirds of the system's outage budget. The remaining one third is composed of even more difficult factors to control. System instability due to hardware and software upgrades is a significant factor. Outages due to operator errors can occur just like those due to natural disasters and vandalism, which have statistical significance.
Cold, warm, hot
Software has a key role to play in the fault management of a system. The duration of outages, whether caused by hardware or software, is determined by the system's ability to restart service. The strategies employed in the fault-management process vary if the system is using a hot-restart, a warm-restart, or a cold-restart model. The restart model is affected by the amount of information available to the system at the time of an event - the more information available, the faster the restart.
A hot-restart system has the fastest recovery time, but is the most complex to implement. In a hot-restart model, the application saves state information about the current activity of the system. That information is given to the standby component so that it is ready to take over quickly. Applications must be designed to restart using this state information for the system to work.
A hot-restart system also requires that a standby component is designated prior to a fault-management event. In clustered systems (2N) this is straightforward since there is a one-to-one correspondence between components and their standbys (see Figure 2 ). In N + 1 systems, hot restart requires the standby device to save the states of multiple components. The standby must have the extra capacity to do this. Otherwise a warm-restart model must be used.
A warm restart is similar to the hot-restart model. In a warm-restart model, the applications save state information about the current activity of the system (see Figure 3 ). The standby component is not designated until the fault-management cycle is in progress. The standby component is then configured with the necessary application and state information. This adds time to the restart process, but can reduce costs associated with the standby components. Warm restart is also easier to implement in systems where the standby devices are not identical to the active devices.
A cold restart is the least complex to implement, but requires the most time. A cold restart implies that the starting place for the standby element is its initialization point. A cold restart is used when no information is available about the state of the failing component. The last known state is therefore initialization.
A cold-restart system can be implemented with little or no changes to the applications of a system. The high-availability-specific software components can be relegated to OS software and services. But the price for this simplicity is much longer restart times and loss of any current activity in the system.
The times associated with the different restart models vary depending on the implementation of the system and application software. In relative terms, if the hot-restart model is 1X, the warm-restart model can be 2 to 3X and the cold-restart model is approximately 10 to 100X. Restart times will be the lowest in systems where the system software and applications are designed to support high availability.
Fault-management software A well-designed set of system services can alleviate much of the work an application needs to perform in order to be highly available. By providing a stable platform for system implementation, high degrees of availability can be achieved. High availability requires careful design at all levels of system implementation.
Software infrastructure for high availability begins with the OS (see Figure 4 ). To allow the basic hot swap of components in a CompactPCI (CPCI) system, the OS must support services that allow the dynamic loading and unloading of device drivers and the reallocation of system resources to those drivers. This is the most fundamental service needed to enable basic repair and reconfiguration of the system.
CPCI hot swap also provides for a more automated version of hot swap known as full hot swap. The purpose of full hot swap is to improve the operator interface for the system. The budget for high availability requires that the operator interface be as resilient to errors as possible. Full hot swap uses a signal (called ENUM) to alert the system to the impending removal or insertion of a board, allowing the system resources to be automatically reconfigured. Full hot swap supports the use of nonhardened drivers for simpler system models.
Higher levels of availability are achieved through hardened drivers. Hardened drivers understand that the underlying hardware may be malfunctioning or not present altogether. Hardened drivers allow applications to apply higher restart models by implementing off-line and standby modes of operation. These models provide diagnostic information for the fault-management system. Drivers are the first layer of fault detection for a highly available system.
Adding topology management Managing the system topology is the next layer of infrastructure needed in a highly available system. Managing the topology of a system involves electrically isolating components of a system without having service personnel physically on site. The ability to manage the topology of the system allows unattended reconfiguration of that system. Topology management also contributes to the stability of the system.
CPCI offers this capability through the high-availability system model. Systems implementing the hot-swap controller can order individual boards to be powered off or reset under software control.
Topology management in an open-architecture system demands flexibility. From the perspective of an equipment provider for OEM communication systems, the final application payload is often added by the next value-add integrator. The topology of the system will be dynamic and the topology management must be equally flexible. Topology management must be driven by a database that users can access. A well-designed topology manager hides the underlying system infrastructure from the user.
Automated fault, event management All of the software described so far provides the basic ability to control the fundamental system components. This software infrastructure is comparable to a similar hardware infrastructure that supports repair and reconfiguration of the system. To achieve high availability (that is, 99.999% uptime), the fault-management process must be fully automated. Relying on an operator to manage a system will take too long.
A fully automated management system begins with an event/fault manager. Events such as upgrades in hardware or software and events driven by failures in hardware or software are essentially the same from a management standpoint. They both trigger actions necessary to change the system operation and return to a stable, service-providing state.
Since the topology of the system is flexible, event management must also be flexible. The event manager must be able to receive direction from procedures defined by the system integrator. A well-designed event manager provides the user with an interface that makes defining these procedures simple.
Higher-level services A distributed processing environment (DPE) coupled with checkpointing and heartbeat services facilitates active/standby operation. These services enable the application to implement a higher-level restart model. Checkpointing provides the application with a means to save state. The heartbeat service keeps the active and standby in synchronization and serves as a means to detect a failure of the checkpointing process.
The DPE abstracts the system configuration from the application. Working with the topology manager, the user application does not need to know the actual configuration of the system, such as where the standby device is.
The event manager works with the topology manager and the DPE to hide system events from the application layer. Changes in topology due to faults can be handled by logically mapping transactions to other devices when one is changed or failed.
Another important service that increases the availability of a system is diagnostics and diagnostic management. Diagnostics are used by the event manager to determine a failed component. A common re-quirement for telecom platforms is 95% first-time accuracy identifying the failed component.
Since devices in an active/standby system have state, a diagnostic manager is used to execute the correct diagnostic for the correct state of the device. The diagnostic manager works with the topology manager and event manager to determine if the device is capable of operating in the desired state.
An application manager allows application software to be managed like a fault zone. In COTS, software is often deployed without the extensive testing that fault-tolerant software used to receive. Application management allows the system to fail and restart the software just as failed hardware would be treated.
All of the infrastructure described still does not result in a five-nines available system. The previous availability budget for a five-nines system allocates 1 minute per year for system upgrades. The ability to do online software upgrades is essential to reach this goal. Online software upgrading requires an active/standby application model that features all of the previous infrastructure components.
Since very few systems are truly stand-alone, the final piece to achieving five nines is an interface to a management network. Fault management is best when approached hierarchically. Therefore, such a construction generally will provide the best results. When a component is unable to resolve a particular situation, the fault-containment strategy becomes stronger if the problem is escalated to a higher-level entity.
Third-party software
High-availability technology is quickly moving out of the proprietary platforms of the major telecom equipment providers and into open-architecture platforms. Equally open software for high availability is also becoming more common.
One source of this software is from vendors creating highly available OSes. Many vendors are working on such efforts but none have announced any products, though this picture will be very different in 2001. Some of today's OSes claim high availability but are limited to clustered (2N) applications. Another option is a combination of OSes supporting basic services and third-party fault-management software that works with a variety of OSes. While some products may implement most of the required components, the user still must develop the procedures and policies for the management of the target platform.
The proper software infrastructure is as important as the proper hardware infrastructure for high availability. Availability only comes after many well-crafted components are properly integrated, and it is equally important to understand just what is required of a system in terms of availability.
Higher levels of availability come with a price. The correct system architecture, restart model, and software infrastructure can result in a system that is optimal in price and performance. It is crucial that system integrators and platform providers work together to address all of the required elements to achieve the desired availability. A platform provider that is experienced in making these choices can be an invaluable partner.
|