Commsdesign Home Register About Commsdesign Feedback Online Opportunities SpecSearch GlobalSpec




















eLibrary

EE TIMES NETWORK
 Online Editions
 EE TIMES
 EE TIMES ASIA
 EE TIMES CHINA
 EE TIMES FRANCE
 EE TIMES GERMANY
 EE TIMES INDIA
 EE TIMES JAPAN
 EE TIMES KOREA
 EE TIMES TAIWAN
 EE TIMES UK

 EE TIMES EUROPE
 ANALOG EUROPE
 INDUSTRIAL EUROPE
 AUTOMOTIVE DL EUROPE

 POWER DL EUROPE

 Web Sites
 • Audio DesignLine
 • Automotive DesignLine
 • Career Center
 • CommsDesign
 • Microwave
    Engineering
 • Deepchip.com
 • Design & Reuse
 • Digital Home DesignLine
 • DSP DesignLine
 • EDA DesignLine
 • Embedded.com
 • Elektronik i Norden
 • Green SupplyLine
 • Industrial Control
    DesignLine
 • Planet Analog
 • Mobile Handset
    DesignLine
 • Power Management
    DesignLine
 • Programmable Logic
    DesignLine
 • RF DesignLine
 • RFID-World
 • Techonline
 • Video | Imaging
    DesignLine
 • Wireless Net
    DesignLine

ELECTRONICS GROUP SITES

 • eeProductCenter
 • Electronics Supply &
    Manufacturing
 • Conferences
    and Events
 • Electronics Supply &
    Manufacturing--China
 • Electronics Express
 • Webinars


09 February 2010



Tutorial on Analyzing High Reliability: Part 1

Part 1 of this series explores the fundamental elements needed to deliver 99.999% reliability in telecom and networking platforms.

By Jeffrey S. Pattavina, Harris Corporation
CommsDesign
Mar 10, 2004
Print This Story Send As Email Reprints
 
It's hard to talk about a networking, telecom, or wireless box architecture today without mentioning reliability. With traditional telecommunication operators once again taking the point in the industry, equipment vendors are being tasked with providing systems that provide 99.999% reliability and beyond.

But, what does it really take to reach the coveted Five Nines plateau? And how can you prove that you made it there? In part 1 of this two-part tutorial, we'll provide and overview of the fundamentals of reliability theory. In Part 2, we'll further the discussion by describing a systematic method for analyzing the reliability of redundant systems using matrix techniques.

Note: For the purposes of this paper, we will define a system as comprised of a number of elements or subsystems interconnected. In some cases if any single component fails the whole system fails. This type of system is referred to as a series-connected system. In other cases elements are placed in parallel-redundant fashion such that a single component failure does not cause a system failure. This is referred to as a redundant or parallel-connected system.

The system can further be classified as being maintained or non-maintained. These are also referred to as systems with repair and without repair respectively. For maintained systems the failed unit is eventually replaced or repaired. For non-maintained systems the failed unit is not repaired.

Reliability: The Basics
Before taking a deep dive into how to analyze the reliability of a system, designers must first understand the basic elements required to incorporate reliability into a system design. Overall there are nine elements that designers must consider. These include:

  1. Failure Density
  2. Reliability Function
  3. Failure Fuction
  4. Conditional Failure Density
  5. Bathtub Curve
  6. Accelerated Stress Testing
  7. Constant failure Rate
  8. Series Connected System
  9. Mean-Time Before Failure
  10. Failures in Time

Let's look at each in more detail starting with failure density and the reliability function.

To understand failure density, Let t = 0 be the time a unit is put into service and let Tf represent the time to failure. Tf is a random variable and is referred to as the lifetime. The unconditional probability of a failure per unit time is called the failure density (also referred to as the Hazard rate) f(t). The probability of failing in any given time interval is found by integrating the failure density over the time interval as shown in Equation 1:

The reliability function R(t), of a system at time t is the cumulative probability that the system has not failed from 0 to t. This can be represented as:

The following relationship between R(t) and f(t) follows:

Failure Function and Conditional Failure Density
The failure function F(t), of a system at time t is the cumulative probability that the system has not failed from 0 to t. The failure function and reliability function are related as follows:

The conditional probability of a system failing during the interval t and t + Δt, given it has not failed at time t is given by the conditional failure density h(t), which is also referred to as the failure rate.1.

The failure density f(t) and failure rate h(t) can be used to find the probability of a system failing in a given time interval as follows:

The difference between the conditional and unconditional probabilities can be illustrated by a simple example. The unconditional probability a man dies between the years 99 and 100 is very small since the probability he dies prior to year 99 is large. In contrast, the conditional probability a man dies between years 99 and 100, given he is alive at year 99 is large.

Bathtub Curves and Stress Tests
Typically, a population of communication systems will exhibit reliability characterized by a bathtub curve2,4,5 as shown in Figure 1.


Figure 1: Diagram illustrating a typical bathtub curve

The bathtub curve shows three distinct failure regions. Region I is the early life or infant mortality region where the failure rate starts out high near time zero and decreases. The failure rate in this area is attributed to defects in the parts or process. Region 2 is the useful life region where the failures rate is characterized as being a constant represented by the flat portion of the curve. Region 3 is the wear out region and is that part of the bathtub curve where the failure rate increases.

In addition to using bathtub curves, manufacturers use accelerated stress testing (AST) to ensure that failures due to defects occur before the product leaves the factory. Therefore the reliability for the units the customer receives after burn-in should lie in the useful life region of the bathtub curve. This process of weeding out defects is referred to as burn-in. Note: The consequence of assuming a constant failure rate h(t) is significant and in fact the constant failure rate assumption is maintained throughout the rest of the paper.

Constant Failure Rate
Now that we've looked at bathtub curves and AST testing, let's explore the case where we are operating on the flat portion of the bathtub curve where the failure rate h(t) is constant over time and remains the same for a unit regardless of the unit's age. For this failure rate, we say that the system is memoryless.

Mathematically we say that a random variable is memoryless if:

Equation 6 tells us that given the unit is alive at t1, the probability it survives for an additional t2 is the same the same unconditional probability that it survives for t2. In other words, the unit has no memeory of t1.

We will now show that the reliability function R(t) for a system with a constant failure rate λ is a negative exponential distribution with parameter λ given by:

From Equation 7, we derive the failure density to be:

Note that the failure density f(t) is also exponentially distributed. Since this density is exponentially distributed, we can show the conditional failure density is constant, as shown in Equation 9.

Therefore a system characterized by a constant failure rate h(t) = λ leads directly to the exponential reliability distributions for R(t) and f(t) given by Equations 7 and 8, respectively.

To illustrate the memoryless property of Equation 7, let's find the conditional probability that a unit survives for t1 + t2 hours given it has already survived for t1 hours:

Equation 10 tells us that if the reliability function is exponential then the probability of a failure in an interval 0 to t1 + t2 given it has survived until t1 is the same as the probability it survives from 0 to t2. In other words, the system has no memory that it survived from 0 to for t1.

From these results, we can see that the system is operating in the useful-life region of the reliability curve for which the failure rate is constant. A system with constant failure has an exponential reliability function and the system is memoryless . Practical implications of the constant failure rate are that it allows us to solve system reliability equations using differential equations with constant coefficients. In addition, the memoryless property will allow us to alternatively work with Markov Chains, which provide a means for the calculations to be done using standard matrix techniques. These concepts will be explored in more detail later on. It should be noted that the memoryless assumption has been empirically justified for many types of system and components.5

By virtue of the memoryless property, all parts and subsystems are equally likely to fail from that point on regardless of how long they have already been in the system. This is a key feature of the memoryless assumption which without would make tracking of system reliability impractical.

The Series-connected System
Consider a system such that any single component failure causes the whole system to fail. This type of system is referred to as a series-connected system (Figure 2).


Figure 2: Diagram of a series-connected system.

Now suppose that n units have been put into service. We wish to find the probability P1(t) of having a single failure in an interval Δt given that the average number of failures per unit time for the ith unit is hi(t). We assume the following hi = λi, where λii is a constant.

In a small interval δt, the probability of a failure is λiΔt, provided Δt is small enough that the probability of having two or more simultaneous failures is negligible. In other words the probability of simultaneous failures approaches 0 as Δt approaches 0. Similarly the probability of having no failures in the interval Δt is 1- λ Δt. This is referred to as a Poisson process.1,2

Now let's demonstrate that failures, which occur with probabilities defined by a Poisson process as described above leads to the exponential failure distribution. The states for a series-connected system with n elements are defined as:

  • State 0 = System working (no failures)
  • State 1 = System failed (at least one failure)

The state equations for the series-connected system is given by Equation 11 below. This equation states that the probability of not failing in the interval from t to Δt is equal to the joint probability of not failing from 0 to t and not failing in the interval t to Δt.

Rearranging yields:

If we define λ = (λ1 + λ2 + ... λn) and let Δt approach 0, the difference equation becomes the first-order differential equation:

It should be noted that due to the memoryless assumption h(t) is a constant, which allows us to work with linear differential equations with constant coefficients. Applying the Laplace transform to Equation 13 yields:

The initial conditions for the system assumes there are no failures (in state 0) at time t = 0.

From which the probability of being in state 0 (no failures) is:

This is the reliability function R(t) in the complex frequency domain. The reliability function R(t) is found by taking the inverse Laplace transform of equation 16 which yields:

The corresponding failure distribution for the series-connected system is:

Therefore, we have shown that if the failure process is Poisson and the conditional failure rate is constant then then the reliability function R(t) and failure distribution f(t) are both exponential.

Let's use an example to illustrate the point further. Let's assume the amount of time a light bulb works before burning out is exponentially distributed with mean lifetime = 20 hours. If a person walks in the room and the light bulb is working what is the probability the light bulb will last 5 additional hours.

Solution: Since the light bulbs lifetime distribution is memoryless the probability the light bulb will last 5 additional hours is:

To illutrate the concepts further, plots for f(t), F(t), and R(t) are shown below for the case where h(t) = λ = 0.05 in Figures 3 and 4.


Figure 3: f(t) given &lamda; = 0.05.


Figure 4: R(t) and F(t) given λ = 0.05.

MTBF
The most well known metric in reliability theory is mean time between failures (MTBF). MTBF represents the average or mean lifetime of a unit. Finding the expected value of the failure density given by Equation 18, yields the desired MTBF:

Therefore, for the specific case where the failure distribution is exponential, the MTBF is simply the reciprocal of the failure rate &lambda.

Consider now the series-connected system with n elements. Defining λ = (λ12+...λn) and The total failure rate is the sum of the failure rates of the individual units. The MTBF is then the inverse of the total failure rate, as shown in Equation 21:

The composite MTBF for the series connection with n identical units is therefore:

Failures in Time
We see that the MTBF for series connected units add like resistors in parallel. As a result it is often more convenient to work with the reciprical function of MTBF referred to as failures in time (FITS). FITS represents the number of failures per billion hours as shown in Equation 23.

Converting equation Equation 21 for the series-connected system to FITS yields:

On to Part 2
That concludes our look at the fundamental elements needed to make high avaialbility work in a telecom system design. In Part 2, we'll further the discussion by describing a systematic method for analyzing the reliability of redundant systems using matrix techniques.

References

  1. Baht U.N., Elements of Stochastic Processes, 1984, Wiley.
  2. Gerald Sandler, System Reliability Engineering, Prentic-Hall.
  3. "Fundamentals of HALT/HASS Testing", White Paper by Keithly Instruments Inc.
  4. BetaTherm Reliability Model, BetaTHERM Inc.
  5. Robert Poltz, "Reliability Engineering", ChipCenter.
  6. M. Wiley, "Reliability Testing — Verifying Reliability for Embedded Systems", Paragon Inovations
  7. R. Feldman, C. Valdez-Flores, Applied Probability and Stochastic Processes, 1996, PWS Publishing.

About the Author
Jeffrey S. Pattavina is the chief system engineer for Harris Corporation's Intraplex access products group. A member of IEEE, Jeff holds a Masters in Electrical Engineering from Northeastern University, Boston, MA. Jeff can be reached at jpattavi@harris.com.




EE Times TechCareers
Search Jobs

Enter Keyword(s):


Function:


State:
  

Post Your Resume
-----------------
Employers Area
Most Recent Posts
Ascension Health seeking Solutions Development Analyst in St. Louis, MO

National Semiconductor seeking Principal IC Design Engineer in Santa Clara, CA

Taylor Guitars seeking Sr. Web Designer in El Cajon, CA

Covidien seeking Hardware Manager in Boulder, CO

Sierra Nevada seeking Software Engineer in Hagerstown, MD

More career-related news, resources and job postings for technology professionals



Home  |  Register  |  About  |  Feedback  |  Contact   |  Site Map
All materials on this site Copyright © 2010 TechInsights, a Division of United Business Media LLC All rights reserved.
Privacy Statement ¦ Terms of Service