Commsdesign Home Register About Commsdesign Feedback Online Opportunities SpecSearch GlobalSpec




















eLibrary

EE TIMES NETWORK
 Online Editions
 EE TIMES
 EE TIMES ASIA
 EE TIMES CHINA
 EE TIMES FRANCE
 EE TIMES GERMANY
 EE TIMES INDIA
 EE TIMES JAPAN
 EE TIMES KOREA
 EE TIMES TAIWAN
 EE TIMES UK

 EE TIMES EUROPE
 ANALOG EUROPE
 INDUSTRIAL EUROPE
 AUTOMOTIVE DL EUROPE

 POWER DL EUROPE

 Web Sites
 • Audio DesignLine
 • Automotive DesignLine
 • Career Center
 • CommsDesign
 • Microwave
    Engineering
 • Deepchip.com
 • Design & Reuse
 • Digital Home DesignLine
 • DSP DesignLine
 • EDA DesignLine
 • Embedded.com
 • Elektronik i Norden
 • Green SupplyLine
 • Industrial Control
    DesignLine
 • Planet Analog
 • Mobile Handset
    DesignLine
 • Power Management
    DesignLine
 • Programmable Logic
    DesignLine
 • RF DesignLine
 • RFID-World
 • Techonline
 • Video | Imaging
    DesignLine
 • Wireless Net
    DesignLine

ELECTRONICS GROUP SITES

 • eeProductCenter
 • Electronics Supply &
    Manufacturing
 • Conferences
    and Events
 • Electronics Supply &
    Manufacturing--China
 • Electronics Express
 • Webinars


21 November 2009

Feature

High-Availability Systems Made Easy: Part 2


By Chuck Hill

In part one of this article, we looked at system architectures for high availability -- but that is only half the battle. The designer must also consider the software, and especially the software restart model, when developing high-availability architectures.

A true high-availability system is not complete unless all factors capable of causing system outages are considered. And while the mechanisms for modeling and predicting hardware behavior are fairly well understood, some other factors are harder to model and more difficult to predict. The data for these components are typically tracked on an historical basis, since systems do not normally achieve high availability at first deployment. Historical feedback and system refinements are critical to reaching high levels of availability.

Historically, software has been the second-largest factor in system outages, but it does not usually receive the attention that hardware does from a fault-management perspective. Also, software is more frequently changed and upgraded, making it more difficult to predict the long-term impact on system stability. With commercial off-the-shelf systems (COTS) depending on software for fault detection and fault management, software is more critical than ever before.

The chart shown in Figure 1 illustrates another important consideration. Hardware and software failures account for approximately two thirds of the system's outage budget. The remaining one third is composed of even more difficult factors to control. System instability due to hardware and software upgrades is a significant factor. Outages due to operator errors can occur just like those due to natural disasters and vandalism, which have statistical significance.

Cold, warm, hot


Software has a key role to play in the fault management of a system. The duration of outages, whether caused by hardware or software, is determined by the system's ability to restart service. The strategies employed in the fault-management process vary if the system is using a hot-restart, a warm-restart, or a cold-restart model. The restart model is affected by the amount of information available to the system at the time of an event - the more information available, the faster the restart.

A hot-restart system has the fastest recovery time, but is the most complex to implement. In a hot-restart model, the application saves state information about the current activity of the system. That information is given to the standby component so that it is ready to take over quickly. Applications must be designed to restart using this state information for the system to work.

A hot-restart system also requires that a standby component is designated prior to a fault-management event. In clustered systems (2N) this is straightforward since there is a one-to-one correspondence between components and their standbys (see Figure 2 ). In N + 1 systems, hot restart requires the standby device to save the states of multiple components. The standby must have the extra capacity to do this. Otherwise a warm-restart model must be used.

A warm restart is similar to the hot-restart model. In a warm-restart model, the applications save state information about the current activity of the system (see Figure 3 ). The standby component is not designated until the fault-management cycle is in progress. The standby component is then configured with the necessary application and state information. This adds time to the restart process, but can reduce costs associated with the standby components. Warm restart is also easier to implement in systems where the standby devices are not identical to the active devices.

A cold restart is the least complex to implement, but requires the most time. A cold restart implies that the starting place for the standby element is its initialization point. A cold restart is used when no information is available about the state of the failing component. The last known state is therefore initialization.

A cold-restart system can be implemented with little or no changes to the applications of a system. The high-availability-specific software components can be relegated to OS software and services. But the price for this simplicity is much longer restart times and loss of any current activity in the system.

The times associated with the different restart models vary depending on the implementation of the system and application software. In relative terms, if the hot-restart model is 1X, the warm-restart model can be 2 to 3X and the cold-restart model is approximately 10 to 100X. Restart times will be the lowest in systems where the system software and applications are designed to support high availability.

Fault-management software


A well-designed set of system services can alleviate much of the work an application needs to perform in order to be highly available. By providing a stable platform for system implementation, high degrees of availability can be achieved. High availability requires careful design at all levels of system implementation.

Software infrastructure for high availability begins with the OS (see Figure 4 ). To allow the basic hot swap of components in a CompactPCI (CPCI) system, the OS must support services that allow the dynamic loading and unloading of device drivers and the reallocation of system resources to those drivers. This is the most fundamental service needed to enable basic repair and reconfiguration of the system.

CPCI hot swap also provides for a more automated version of hot swap known as full hot swap. The purpose of full hot swap is to improve the operator interface for the system. The budget for high availability requires that the operator interface be as resilient to errors as possible. Full hot swap uses a signal (called ENUM) to alert the system to the impending removal or insertion of a board, allowing the system resources to be automatically reconfigured. Full hot swap supports the use of nonhardened drivers for simpler system models.

Higher levels of availability are achieved through hardened drivers. Hardened drivers understand that the underlying hardware may be malfunctioning or not present altogether. Hardened drivers allow applications to apply higher restart models by implementing off-line and standby modes of operation. These models provide diagnostic information for the fault-management system. Drivers are the first layer of fault detection for a highly available system.

Adding topology management


Managing the system topology is the next layer of infrastructure needed in a highly available system. Managing the topology of a system involves electrically isolating components of a system without having service personnel physically on site. The ability to manage the topology of the system allows unattended reconfiguration of that system. Topology management also contributes to the stability of the system.

CPCI offers this capability through the high-availability system model. Systems implementing the hot-swap controller can order individual boards to be powered off or reset under software control.

Topology management in an open-architecture system demands flexibility. From the perspective of an equipment provider for OEM communication systems, the final application payload is often added by the next value-add integrator. The topology of the system will be dynamic and the topology management must be equally flexible. Topology management must be driven by a database that users can access. A well-designed topology manager hides the underlying system infrastructure from the user.

Automated fault, event management


All of the software described so far provides the basic ability to control the fundamental system components. This software infrastructure is comparable to a similar hardware infrastructure that supports repair and reconfiguration of the system. To achieve high availability (that is, 99.999% uptime), the fault-management process must be fully automated. Relying on an operator to manage a system will take too long.

A fully automated management system begins with an event/fault manager. Events such as upgrades in hardware or software and events driven by failures in hardware or software are essentially the same from a management standpoint. They both trigger actions necessary to change the system operation and return to a stable, service-providing state.

Since the topology of the system is flexible, event management must also be flexible. The event manager must be able to receive direction from procedures defined by the system integrator. A well-designed event manager provides the user with an interface that makes defining these procedures simple.

Higher-level services


A distributed processing environment (DPE) coupled with checkpointing and heartbeat services facilitates active/standby operation. These services enable the application to implement a higher-level restart model. Checkpointing provides the application with a means to save state. The heartbeat service keeps the active and standby in synchronization and serves as a means to detect a failure of the checkpointing process.

The DPE abstracts the system configuration from the application. Working with the topology manager, the user application does not need to know the actual configuration of the system, such as where the standby device is.

The event manager works with the topology manager and the DPE to hide system events from the application layer. Changes in topology due to faults can be handled by logically mapping transactions to other devices when one is changed or failed.

Another important service that increases the availability of a system is diagnostics and diagnostic management. Diagnostics are used by the event manager to determine a failed component. A common re-quirement for telecom platforms is 95% first-time accuracy identifying the failed component.

Since devices in an active/standby system have state, a diagnostic manager is used to execute the correct diagnostic for the correct state of the device. The diagnostic manager works with the topology manager and event manager to determine if the device is capable of operating in the desired state.

An application manager allows application software to be managed like a fault zone. In COTS, software is often deployed without the extensive testing that fault-tolerant software used to receive. Application management allows the system to fail and restart the software just as failed hardware would be treated.

All of the infrastructure described still does not result in a five-nines available system. The previous availability budget for a five-nines system allocates 1 minute per year for system upgrades. The ability to do online software upgrades is essential to reach this goal. Online software upgrading requires an active/standby application model that features all of the previous infrastructure components.

Since very few systems are truly stand-alone, the final piece to achieving five nines is an interface to a management network. Fault management is best when approached hierarchically. Therefore, such a construction generally will provide the best results. When a component is unable to resolve a particular situation, the fault-containment strategy becomes stronger if the problem is escalated to a higher-level entity.

Third-party software


High-availability technology is quickly moving out of the proprietary platforms of the major telecom equipment providers and into open-architecture platforms. Equally open software for high availability is also becoming more common.

One source of this software is from vendors creating highly available OSes. Many vendors are working on such efforts but none have announced any products, though this picture will be very different in 2001. Some of today's OSes claim high availability but are limited to clustered (2N) applications. Another option is a combination of OSes supporting basic services and third-party fault-management software that works with a variety of OSes. While some products may implement most of the required components, the user still must develop the procedures and policies for the management of the target platform.

The proper software infrastructure is as important as the proper hardware infrastructure for high availability. Availability only comes after many well-crafted components are properly integrated, and it is equally important to understand just what is required of a system in terms of availability.

Higher levels of availability come with a price. The correct system architecture, restart model, and software infrastructure can result in a system that is optimal in price and performance. It is crucial that system integrators and platform providers work together to address all of the required elements to achieve the desired availability. A platform provider that is experienced in making these choices can be an invaluable partner.


About the Author

Chuck Hill is a system architect for Motorola Computer Group's Telecom Business Unit. He has six years experience developing fault-tolerant and high-availability systems for Motorola. Chuck is also active in several PICMG committees including the Hot Swap Subcommittee and the Redundant System Slot Subcommittee. He can be reached at chuck_hill@mcg.mot.com


Illustrations

Figure 1:Availability budget.
Figure 2:Hot standby model
Figure 3:Warm restart in N+1.
Figure 4:Software infrastructure for high availability.




Return to the Table of Contents





Virtualab

  • Bozotti details ST's outlook for Q4
  • Wind River's Klein on Intel , multicore, embedded Linux
  • 15 sobering predictions, cures for economy
  • EU project set to make chip cards more secure
  • MORE
    Prototype fuel cell for handsets eyes fivefold run-time boost
    As part of a research collaboration on miniaturized energy sources, the French Atomic Energy Agency (CEA) and STMicroelectronics NV (Geneva) have prototyped a hydrogen fuel cell for mobile phones that aims to reduce dependency on the use of electrical power supplies to recharge batteries. EE Times' Anne-Francoise Pele Takes a closer look.Click here to learn more.

    Tech Article Library
    Check out CommsDesign's Design corner to find a detail technical articles on a host of communication design issues. To access the design corner, click here.

    Phyworks demos 10G copper interconnects
    Communications chip specialist Phyworks (Bristol, England) has demonstrated 10Gbits/s rack-to-rack copper interconnects of up to 30 metres using technology it originally developed for the optical module market. EE Times Europe's John Walko gets the story. Click here for details.

    Puzzled by a network processing design issue?

    Join former NPF CEO Colin Mick in discussing net processing design issues by clicking here!


    EE Times TechCareers
    Search Jobs

    Enter Keyword(s):


    Function:


    State:
      

    Post Your Resume
    -----------------
    Employers Area
    Most Recent Posts
    SEL seeking Business Development Manager in Pullman, WA

    SEL seeking Integration / Automation Engineer in Charlotte, NC

    ESRI seeking Business Manager - Support Services in Redlands, CA

    Amcor PET Packaging seeking Facilities Engineer in Philadelphia, PA

    Mentor Graphics seeking Embedded SW Tele-Sales in San Jose, CA

    More career-related news, resources and job postings for technology professionals




    Home  |  Register  |  About  |  Feedback  |  Contact   |  Site Map
    All materials on this site Copyright © 2009 TechInsights, a Division of United Business Media LLC All rights reserved.
    Privacy Statement ¦ Terms of Service