Commsdesign Home Register About Commsdesign Feedback Online Opportunities SpecSearch GlobalSpec




















eLibrary

EE TIMES NETWORK
 Online Editions
 EE TIMES
 EE TIMES ASIA
 EE TIMES CHINA
 EE TIMES FRANCE
 EE TIMES GERMANY
 EE TIMES INDIA
 EE TIMES JAPAN
 EE TIMES KOREA
 EE TIMES TAIWAN
 EE TIMES UK

 EE TIMES EUROPE
 ANALOG EUROPE
 INDUSTRIAL EUROPE
 AUTOMOTIVE DL EUROPE

 POWER DL EUROPE

 Web Sites
 • Audio DesignLine
 • Automotive DesignLine
 • Career Center
 • CommsDesign
 • Microwave
    Engineering
 • Deepchip.com
 • Design & Reuse
 • Digital Home DesignLine
 • DSP DesignLine
 • EDA DesignLine
 • Embedded.com
 • Elektronik i Norden
 • Green SupplyLine
 • Industrial Control
    DesignLine
 • Planet Analog
 • Mobile Handset
    DesignLine
 • Power Management
    DesignLine
 • Programmable Logic
    DesignLine
 • RF DesignLine
 • RFID-World
 • Techonline
 • Video | Imaging
    DesignLine
 • Wireless Net
    DesignLine

ELECTRONICS GROUP SITES

 • eeProductCenter
 • Electronics Supply &
    Manufacturing
 • Conferences
    and Events
 • Electronics Supply &
    Manufacturing--China
 • Electronics Express
 • Webinars


12 March 2010

Feature

High-Availability Systems Made Easy: Part 1


By Chuck Hill

High availability isn't just a concept, it's a science. Part one of this two-part series explores the various practices available to help designers bypass established methods and achieve the vaunted "five-nines" of network uptime.

Hopefully, when everyday troubles such as power spikes and failures hit our lauded all-digital networks, the outcome will not be mass communication outages. In a world where all types of services -- voice, video, and data -- stake their claim on "24/7" availability, certain safeguards must be in place.

Yet the architectures for tomorrow's computing platforms must be as open and as flexible as today's desktop computers, highly reliable, and a fraction of the cost of traditional fault-tolerant computers.

Now, fault tolerance and high availability are not quite the same thing. Fault tolerance encompasses two properties: transactional reliability and high system availability. Many systems today use fault-tolerant computers when the application only needs availability. Transactional reliability is needed in banking applications or in billing records for a telecom network.

Availability is needed in file-server applications or in call-processing applications. Transactional reliability is not needed for any application where some loss of data is tolerable, or where data transfer is protected by a reliable end-to-end protocol such as TCP/IP.

Traditional implementations of fault-tolerant platforms often involve proprietary hardware and software. This can result in high costs and long design cycles -- two things that are not acceptable in emerging, competitive markets such as telecom.

In this article we'll examine hardware configuration issues for high availability. In the next installment, we'll consider software requirements.

What is high availability?


Availability is not a nebulous term. In fact, it can be expressed mathematically. Put simply, a highly available system is one that is usable when the customer needs it. A system can be highly available operating 8 a.m. to 5 p.m. if that is all the business demands. The remaining time can be used for scheduled maintenance and repair. Availability is defined as actual service divided by required service. The challenge for many of today's systems is to operate 24 hours a day, 365 days a year (sometimes referred to as 7x24, or 365x24).

Availability is often expressed in percentages. A 365x24 system with 99.9% availability has an average down time of 8.76 hours per year (525 minutes). A system with only three minutes of service outage must have 99.999% availability.

Availability is calculated using statistical models for all of the system components, the simplest model for a component being binary. The component is either in or out of service. Availability can be calculated from failure rates, measured in mean time between failures (MTBF), and repair times, measured in mean time to repair (MTTR).

The average down-time contribution by any component is calculated by amortizing the MTTR time over the MTBF period. For example, if a component critical to the operation of the platform has an MTBF of 250,000 hrs and a MTTR of 1 hr, it contributes 2.1 min (60 min/250,000 hrs/8760 hrs/yr), of unavailability to the system per year.

Availability in the two-nines or three-nines range (99% to 99.9%) can be achieved by maximizing the reliability of components and minimizing repair times. To achieve higher reliability or to compensate for less-reliable components, redundancy is used. Having a backup for a component that fails keeps the system operating. Availability of redundant configurations is calculated based on the time taken to detect and switch over to the redundant component.

Fault management and fault coverage have become critical factors in system design. A more complex model for a system with redundancy requires the use of statistical methods to calculate availability.

Quick modeling techniques


A first-pass, quick, combinatorial calculation can be used to determine the relative unavailability contribution of each system component. This calculation ignores the failure-coverage factor. The result generally gives a high-side availability estimate and often points to areas that require more in-depth study. Model simplification can provide detailed models by ignoring components with negligible unavailability contribution.

The reliability block diagram (RBD) is one tool used for first-pass, quick, combinatorial availability calculations. A diagram of a two-board system's RBD is shown in Figure 1.

The system is only "up" if there is a path from the input to the output. Thus the diagram shows that the system is operational if either of the two boards is working. A system such as the one in Figure 1 can be analyzed by simple mathematics. Unavailability for one board is expressed as:

lambda = Failure Rate + 1/MTBF<1/250,000

micro = Repair Rate + 1/MTTR =1/4

Unavailability = lambda/(l + m) =1.5999E-05 for one board

The failure and repair rates are the inverses of the MTBF and MTTR, respectively. In a parallel system, as shown in the RBD, the unavailability factors multiply. The unavailability of the system is:

Unavailability SYS = (1.599E-05)2 = 2.559E-10

RBDs provide a fast, easy-to-obtain result. However, they lack the power needed to model more complex system interactions.

Chain of tools


Continuous-time Markov chains are similar to finite-state diagrams, except that state changes are based on probabilistic elements. Markov chain analysis permits a more sophisticated model. The Markov chain analysis for the two-board system shown in Figure 2 includes a factor for fault coverage (the circled labeled 'c' in Figure 2).

Each oval in Figure 2 represents a state, and the directed arrows, known as arcs, represent state changes. In this model, the states are named in the form xBSy, where x represents the number of operating boards and y is U for system up and D for system down. The definition of each state is listed in Table 1.

The 1BSU and 1BSD states keep track of whether the system is up or down. If the system is down, these states indicate number of boards that need to incur a repair time.

Transitions between states occur at exponentially distributed rates. The respective rate (based on the lamda, micro, and c) is shown beside each arc. Transitions between state 2BSU and 1BSU, for example, are exponentially distributed with rate 2lc (the board failure rate times the probability that a board failure is covered by the standby board). Transitions between states 2BSU and 1BSD occur with rate 2lamda * (1 - c), which is the board failure rate times the probability that a board failure is not covered by the standby.

Thus there is a race between the arcs, to determine whether a transition from state 2BSU will be to 1BSU or 1BSD. For high values of fault coverage (c = 0.9), most state transitions from 2BSU will be to 1BSU. However, the rates are not deterministic -- they represent means of an exponential distribution. While in state 2BSU, on some occasions, the transition rate will be to state 1BSD.

Similarly, since the repair rate micro is large compared to the board failure rate lambda, most transitions from state 1BSU will be back to 2BSU. On rare occasions, where the second board fails before the first is repaired, the transition will be to 0BSD.

Given the Markov chain analysis for this system, system availability is measured by the fraction of the time that the chain is in any up state, in this case 2BSU or 1BSU. Alternatively, unavailability (1 minus the availability) is measured by the fraction of time that the chain is in a down state (1BSD or 0BSD).

Modeling tools, such as Sharpe, provide constructs to make these measurements.1 A Sharpe model for the two-board system yields an unavailability of 3.200102E-06, which is five-nines, but not six-nines, compliant. The availability, therefore, is 1.0 - 3.200102E-06 = 0.9999968.

Markov chain modeling works well for relatively small models, but larger models can result in a state space larger than is manageable by a human being. The model for the two-board system contains four states. Adding a third board to the system, for example, requires the addition of two more states (3BSU and 2BSD), for a total of six.

Consider what would happen if 10 or 12 boards were in the system. This problem has given rise to the use of even more advanced modeling tools.

Redundant configurations


Many techniques for redundancy can be employed in a system architecture. N redundant systems (2N, 3N, 5N, and so on) employ multiple, identical sets of resources isolated into separate fault zones. N redundancy can be employed on a component level (such as disks and power supplies), or in complete systems (such as redundant file servers). The simplest form of this is 2N redundancy.

N + m redundancy involves having individual spares for a group of resources (such as a spare port on an Ethernet hub). The simplest form of this approach is N + 1 redundancy. N + 1 redundancy can also be applied to individual system components or to clusters of complete systems.

The differences in the redundancy schemes are subtle but crucial to system design. 2N redundant systems have a duplicate for every critical resource in the system. The standby resources are kept up-to-date with the activities of the active resource. When a failure occurs, the entire active domain is taken out of service and the standby takes over.

The advantages of 2N redundancy include simple fault management and fast switch-over times. The disadvantages are the cost to duplicate every resource and difficulties with connectivity in systems with a large number of I/O connections.

An example of this situation is in telephony, where a system may employ many T1/E1 connections. Duplicating the line interface boards is expensive. Multiplexing T1 lines or asking the customer to lease a lot of spare lines is also expensive. For I/O intensive applications, N + 1 redundancy can be more cost-effective. With N + 1 redundancy, the spare cannot be fully configured because the exact nature of its task is not known until one of the active devices fails. This complicates fault management and increases switch-over times.

Management process


Fault management differs significantly between N redundant systems and N + 1 redundant systems. A fault-management cycle can be defined in phases: detection, location, isolation, recovery, reporting, repair, and reintegration. These processes are defined below in general terms, and later discussed in the context of 2N and N + 1 fault management. These definitions apply equally to the hardware and software components of a system.

  • Detection is the process of discovering that an error exists. Detection is defined as the time from a failure causing a loss of service to the system becoming aware of it. Error detection is the responsibility of every hardware and software component in the system. To meet availability goals such as 99.999%, adequate error detection must be designed in, or a component may not be suitable for use in high-availability systems.

  • Location is the process of narrowing down the failure to the defective component. This process depends greatly on the definition of a fault zone. At this stage of the process, it is not necessary to locate an error to a region smaller than what will be isolated.

  • Isolation takes the defective portion of the system out of service. The region that is isolated must be bounded at a point where it can be removed from all interaction with the system.

  • Recovery is the process of reassigning the necessary resources to restore the system to an operating state. Recovery also requires restoring any portions of the system that were adversely affected by the failing component. Recovery is the final step in the process that contributes to outage time. Once the system is providing complete service again, the remainder of the process does not directly contribute to outage time.

  • Reporting is the process that notifies the outside world that an event has taken place. This is the first step in the repair process. The repair process is indirectly related to availability. In systems employing redundancy, a statistical possibility exists that a second failure can occur in the component covering for this failure. This would result in a complete system outage. While this probability is low, the severity is high enough to make this a factor in the availability equation. It is important, even in redundant systems, to keep repair times low.

  • Repair is the replacement of the defective component. This phase is generally designated for the operator-assisted (human) portion of the process. The repair process is broken into these phases for a couple of reasons. First, the repair step is usually the most time-consuming portion of the process. Second, this is a point in the process where mistakes can account for system outages.

  • Finally, the repaired component is reintegrated. Once the defective hardware or software component has been replaced, it is brought back into service either as a new standby component or sharing some of the system load.

These distinctions are somewhat arbitrary, but they provide a good reference that can be used to illustrate differences in system architectures.

Fault management in clusters


In a 2N, or clustered system, which has highly encapsulated fault zones, the fault management cycle is somewhat simplified:

  • Detection is crucial. The time that a system is malfunctioning without being detected is considered a direct outage. The ability of a system to detect all possible failures is measured in its fault coverage. Anything not covered is assigned a probability and factored in as a severe outage.

  • Location is implied. Any fault detected is usually detected within the node itself (assuming good fault coverage), so the location is known.

  • Isolation and recovery are essentially the same step. All activity is moved off the failing node to its standby, thus isolating the defective node and recovering the system at the same time.

Clustered systems essentially move from detection to recovery. Location and isolation are inherent in the architecture of the system. The complexity of the recovery process is always dependent on the specific application running on the system.

The remainder of the process proceeds off-line. The repair process can be as simple as replacing the entire node, or the technician may choose to further locate the failed component within the node. This diagnostic is done in an off-line system with full resources available to the repair process. A node-based system is relatively immune to mistakes made during this process.

Finely grained fault management


In systems where 2N configurations are not economical, devices are spared on an N + 1 arrangement. This requires a more finely grained fault management with a more complicated process cycle.

In these systems, fault detection is always the same. It needs to be done by every resource in the system, and as quickly as possible. Location is done with an on-line diagnostic. This diagnostic should not interfere with the operating portion of the system. The goal is to keep as much of the system operating as possible to avoid total outages.

Locating a failure in an active system is complicated. Failures cannot always be precisely located. A typical metric for on-line fault identification is 95% fault location accuracy.

Isolation is critical. To spare on more finely grained boundaries, the system must have an infrastructure that permits isolation of individual field-replaceable units (FRUs). In an N + 1 system, failed components must exist benignly while system activity continues.

Recovery is more complicated for two reasons. With nodes, a clean boundary can be placed around a node that hides the complexities of the system. An N + 1 system has a hierarchy of component dependencies, as shown in Figure 3. When a component is determined to be bad, any other system components depending on this resource must also be recovered. This aspect of topology management becomes part of the critical path to restoring service.

Another difficulty of fault recovery in an N + 1 system is due to incomplete encapsulation of faults. A failing device may push other devices into error states. After a defective device has been identified, all the other error conditions need to be corrected.

Reporting is application dependent. With more finely grained FRUs, a more detailed method of indication is needed. Reporting becomes crucial to the repair process. Since an operator or technician is repairing a portion of a live system, it is essential to properly identify the specific FRU to be replaced. Operator errors account for a significant loss of service in these types of systems. Clean design of a reporting mechanism can minimize these mistakes.

Repair in a live system requires some form of "hot replacement." The system must be designed to support this. Another consideration for repair of an active system is to ensure that FRUs do not mechanically interact. It is not good practice to have to remove one component to get to another.

Central services, sir?


The final phase of reintegration is again slightly more complicated because it is taking place in a live system. Resources that are providing service must now be employed to reintegrate the new component.

The fault management cycle in N+1 systems requires much more processing from the CPU responsible for managing the system. A CPU that is involved in system activity must also have reserve processing capacity for fault management.

Traditionally, highly available systems were the result of significant in-house development by a few companies that specialized in this technology. High availability was a core competency of the major telecommunication equipment providers. For this technology to become commonplace, the competency must become part of a diverse landscape of equipment and component suppliers.

One of the first groups to address the issues of increasing system availability was the peripheral component interconnect (PCI) Special Interest Group (SIG). PCI SIG developed a standard for "Hot Plug" of PCI form-factor components used in many servers and industrial applications. This ability to "Hot Plug" components allows the repair or reconfiguration of a system without removing it from service.

The Intelligent Platform Management Interface (IPMI) provides a standardized mechanism for detecting and managing system components. System management is an important part of the fault management process.

Another organization focused on higher availability systems is the PCI Industrial Computer Manufacturers' Group (PICMG). The specifications for the CompactPCI (CPCI) form factor include a specification for hot swapping and an instantiation of IPMI for system management. Another specification to make the central services for PCI redundant is also under way. CPCI offers a more rugged form factor and faster repair times than the PCI form factor.

Both PCI and CPCI form factors offer basic features to construct systems with high availability. With the basic infrastructure these form factors offer, constructing high-availability systems becomes an exercise in provisioning the systems with adequate redundancy and providing the necessary software for fault management. The software for fault management is as important as the hardware infrastructure. This software infrastructure is the subject of the second part of this article.


About the Author

Chuck Hill is a system architect for Motorola Computer Group's Telecom Business Unit. He has six years experience developing fault-tolerant and high-availability systems for Motorola. Chuck is also active in several PICMG committees including the Hot Swap Subcommittee and the Redundant System Slot Subcommittee. He can be reached at Chuck_hill@mcg.mot.com.


Illustrations

Figure 1: RBD for two-board systems.
Figure 2: Example of a dependency tree.
Figure 3: Markov chain showing dependencies for a two-board system.


Tables
Table 1: Various Board States Explained


References
  1. Sahner, R., Trivedi, K., and Puliafito, A., Performance and Reliability Analysis of Computer Systems, Kluwer Academic Publishers, 1996.




Return to the Table of Contents





Virtualab

  • Analysts: Five observations on mobile from MWC
  • M'soft says no comment on Project Pink phone
  • What made you become an EE? Join the Conversation
  • Nvidia blames sales shortfall on TSMC
  • MORE
    Prototype fuel cell for handsets eyes fivefold run-time boost
    As part of a research collaboration on miniaturized energy sources, the French Atomic Energy Agency (CEA) and STMicroelectronics NV (Geneva) have prototyped a hydrogen fuel cell for mobile phones that aims to reduce dependency on the use of electrical power supplies to recharge batteries. EE Times' Anne-Francoise Pele Takes a closer look.Click here to learn more.

    Tech Article Library
    Check out CommsDesign's Design corner to find a detail technical articles on a host of communication design issues. To access the design corner, click here.

    Phyworks demos 10G copper interconnects
    Communications chip specialist Phyworks (Bristol, England) has demonstrated 10Gbits/s rack-to-rack copper interconnects of up to 30 metres using technology it originally developed for the optical module market. EE Times Europe's John Walko gets the story. Click here for details.

    Puzzled by a network processing design issue?

    Join former NPF CEO Colin Mick in discussing net processing design issues by clicking here!


    EE Times TechCareers
    Search Jobs

    Enter Keyword(s):


    Function:


    State:
      

    Post Your Resume
    -----------------
    Employers Area
    Most Recent Posts
    Accenture seeking Project Management Team Lead in Charlotte, NC

    Accenture seeking Software Engineer in Salt Lake City, UT

    Boeing Company seeking Software Engineer in Herndon, VA

    Switch and Data seeking Customer Solutions Engineer in Dallas, TX

    Chart Industries seeking Sr. Developer in Cleveland, OH

    More career-related news, resources and job postings for technology professionals




    Home  |  Register  |  About  |  Feedback  |  Contact   |  Site Map
    All materials on this site Copyright © 2010 EE Times Group, a Division of United Business Media LLC All rights reserved.
    Privacy Statement ¦ Terms of Service