Measuring and Designing for Reliability and Maintainability

HOME | FAQ | Books | Links


AMAZON multi-meters discounts AMAZON oscilloscope discounts


Insanity is doing the same thing over and over again and expecting different results.

-- Albert Einstein

  • 1 Introduction
  • 2 Terminology
  • 3 Defining and Measuring Reliability and Other Terms
  • 4 Designing and Building for Maintenance and Reliability
  • 5 Summary
  • 6 QUIZ

Learning goals:

• Understand reliability and why it’s important

• Calculate reliability, availability, and maintainability

• Measure and specify reliability

• Review designs for reliability

• Explain the impact of O&M costs on an asset's life cycle cost

1 Introduction

Asset reliability is an important focal point for many organizations.

It's a source of competitive advantage for many visionary companies. It’s the central theme for maintenance departments trying to improve their bottom line. To some, reliability identifies the right work and is synonymous with reliability centered maintenance (RCM). Reliability is not just RCM, however; it has a much broader meaning. Understanding the term reliability and how it differs from maintenance is key to establishing a successful program for improving reliability in any organization. In this section, we will define key terms related to reliability and discuss important factors that will help realize higher reliability of assets and plants.

What and Why Reliability?

Reliability is a broad term that focuses on the ability of an asset to per form its intended function to support manufacturing or to provide a service. Many books written about reliability tend to focus on Reliability Centered Maintenance (RCM). Reliability is not just RCM. RCM is a proactive methodology utilizing reliability principles for identifying the right work to be done to maintain an asset in a desired condition so that it can keep performing its intended function. In fact, RCM is basically a PM optimizing tool to define the "right" maintenance actions. In its most effective and widely-accepted form, it consists of seven structured steps for building a maintenance program for a specific asset. When organizations first try to improve reliability, they label this undertaking as RCM, but RCM really differs from a reliability improvement initiative. Details of the RCM process will be discussed in section 8.

Improving asset reliability is important to the success of any organization, particularly to its operation and maintenance activities. To do this, we need to understand both reliability and maintenance, and how they're interrelated. Reliability is the ability of an asset to perform a required function under a stated set of conditions for a stated period of time, called mission time. Three key elements of asset reliability are the asset function, the conditions under which the asset operates, and mission time. The term reliable assets means that the equipment and plant are available as and when needed, and they will perform their intended function over a predetermined period without failure. Reliability is a design attribute and should be "designed in" when an asset is designed, built, and installed.

On the other hand, maintenance is an act of maintaining, or the work of keeping an asset in proper operating condition. It may consist of per forming maintenance inspection and repair to keep assets operating in a safe manner to produce or provide designed capabilities. These actions can be preventive maintenance (PM) and corrective maintenance (CM) actions. So, maintenance keeps assets in an acceptable working condition, prevents them from failing, and, if they fail, brings them back to their operational level effectively and as quickly as possible.

Maintainability is another term we need to understand with reliability. It’s another design attribute which goes hand in hand with reliability.

It reflects the ease of maintenance. The objective of maintainability is to insure maintenance tasks can be performed easily, safely, and effectively.

Reliability and maintainability attributes are usually designed into the asset to minimize maintenance needs by using reliable components, simpler replacement, and easier inspections.

With these definitions, the differences start to become clear.

Reliability is designed in and is a strategic task. Maintenance keeps assets functioning and is a tactical task. Maintenance does not improve reliability, it just sustains it. Improving reliability requires redesign or replacement with better and reliable components. Improving reliability needs a new thinking - a new paradigm. Rather than asking how to restore the capability of a failed asset efficiently and effectively, we need to ask what we can do proactively to guarantee that the asset does not fail within the context of meeting the business needs of the overall operation.

A challenge in this transition is the belief that we should strive to maximize asset reliability. However, it has been found that insuring 100% reliability--although a great goal--often results in high acquisition costs and may require a high level of maintenance to sustain high reliability. It may not be a cost-effective strategy and may not be affordable. We need to define an asset's or plant's reliability requirements in the context of supporting the underlying business needs. Then, we inevitably realize that we may need a different reliability and an affordable maintenance program.

As shown in FIG. 1, we need to find the right level of reliability required to give us the optimum total cost. This graph illustrates the production or use cost, which is operations and downtime cost verses the reliability (and maintenance) cost.

==========

Cost Reliability (availability) Optimum level Total cost Reliability Production & use costs

FIG. 1 Reliability/ Availability Economics

==========

Why is Reliability Important?

Asset reliability is an important attribute for several reasons, including:

• Customer Satisfaction. Reliable assets will perform to meet the customer's needs on time and every time. An unreliable asset will negatively affect the customer's satisfaction severely. Thus, high reliability is a mandatory requirement for customer satisfaction.

• Reputation. An organization's reputation is very closely related to the reliability of their services. The more reliable that plant assets are, the more likely the organization is to have a favorable reputation.

• O & M Costs. Poor asset performance will cost more to operate and maintain.

• Repeat Business. Reliable assets and plant will insure that customer's needs are being met in a timely manner. Customer satisfaction will bring repeat business and also have a positive impact on future business.

• Competitive Advantage. Many leading and visionary companies have begun achieving high reliability / availability of their plants and assets. As a result of their greater emphasis on plant reliability improvement programs, they gain an advantage over their competition.

Reliability vs. Quality Control

In a manufacturing process, quality control (QC) is concerned with how the process is meeting specifications to guarantee consistent product quality. Its objective is to see that both an asset and its components are manufactured and assembled with high quality standards and meet the designed specifications. Thus, QC is a snapshot of the manufacturing process' quality program at a specific time. Reliability is usually concerned with failures after an asset has been put in operation for its whole life. The QC of manufacturing processes for building assets makes an essential contribution to the reliability of an asset - it can be considered as an integral part of an overall reliability program.

The same way that a chain is only as strong as its weakest link, an asset is only as good as the inherent reliability of the asset, and the quality of the manufacturing process used to build or assemble this asset. Even though an asset may have a reliable design, its reliability may still be unsatisfactory when the asset is built and installed or used in the field. The reason for this low reliability may be that the asset or its components were poorly built. This could be the result of a substandard manufacturing process to build the asset. For example, cold solder joints could pass initial testing at the manufacturer, but fail in the field as the result of thermal cycling or vibration. This type of failure does not occur due to poor design, but as a result of an inferior manufacturing process.

Usually assets are designed with a level of reliability based on the effective use of reliable components and their configurations. Some components may be working in series and others in parallel arrangements to provide the desired overall reliability. This level of reliability is called inherent reliability. After the asset has been installed, the reliability of an asset cannot be changed without redesigning or replacing it with better and improved components. However, asset availability can be improved by repairing or replacing bad components before they fail, and by implementing a good reliability-based PM plan.

Evaluating and finding ways to attain high asset reliability are key aspects of reliability engineering. There are a number of practices we can apply to improve the reliability of assets. We will be discussing these practices to improve reliability later in this section, as well as in other sections.

2 Terminology

Availability (A)

The probability that an asset is capable of performing its intended function satisfactorily, when needed, in a stated environment.

Availability is a function of reliability and maintainability.

Failure

Failure is the inability of an asset / component to meet its expected performance. It does not require the asset to be inoperable. The failure could also mean reduced speed, or not meeting operational or quality requirements.

Failure Rate

The number of failures of an asset over a period of time. Failure rate is considered constant over the useful life of an asset. It’s normally expressed as the number of failures per unit time.

Denoted by Lambda (?), failure rate is the inverse of Mean Time Between Failure (MTBF). Maintainability (M)

The ease and speed with which a maintenance activity can be carried out on an asset. Maintainability is a function of equipment design and usually is measured by MTTR.

Mean Time Between Failures (MTBF)

MTBF is a basic measure of asset reliability. It’s calculated by dividing total operating time of the asset by the number of failures over some period of time. MTBF is the inverse of failure rate (?). Mean Time to Repair (MTTR)

MTTR is the average time needed to restore an asset to its full operational condition upon a failure. It’s calculated by dividing total repair time of the asset by the number of failures over some period of time. It’s a basic measure of maintainability.

Reliability (R)

The probability that an asset or item will perform its intended functions for a specific period of time under stated conditions. It’s usually expressed as a percentage and measured by the mean time between failures (MTBF).

Reliability Centered Maintenance (RCM)

A systematic and structured process to develop an efficient and effective maintenance plan for an asset to minimize the probability of failures. The process insures safety and mission compliance.

Uptime

Uptime is the time during which an asset or system is either fully operational or is ready to perform its intended function. It’s the opposite of downtime.

3 Defining and Measuring Reliability and Other Terms

There are two types of assets: repairable and non-repairable.

Assets or components that can be repaired when they fail are called repairable, e.g., compressors, hydraulic systems, pumps, motors, and valves. Reliability of these repairable systems is characterized by the term MTBF (Mean Time Between Failure).

Assets or components that can't be repaired when they fail are called non-repairable, e.g., bulbs, rocket motors, and circuit boards. Some components such as integrated circuit boards could be repaired, but the repair work will cost more than the replacement cost of a new component.

Therefore, they're considered non-repairable. Reliability of non repairable systems is characterized by the term MTTF (Mean Time to Failure).

Reliability, Maintainability and Availability

Reliability (R), as defined in military standard (MIL-STD-721C), is "the probability that an item will perform its intended function for a specific interval under stated conditions."

As defined here, an item or asset could be an electronic or mechanical hardware product, software, or a manufacturing process. The reliability is usually measured by MTBF and calculated by dividing operating time by the number of failures. Suppose an asset was in operation for 2000 hours (or for 12 months) and during this period there were 10 failures. The MTBF for this asset is:

MTBF = 2000 hours / 10 failures = 200 hours per failure or 12 months / 10 failures = 1.2 months per failure

A larger MTBF generally indicates a more reliable asset or component.

Maintainability (M), is the measure of an item's or asset's ability to be retained in or restored to a specified condition when maintenance is per formed by personnel having specified skill levels, using prescribed procedures and resources at each stage of maintenance and repair.

Maintainability is usually expressed in hours by Mean Time to Repair (MTTR), or sometimes by Mean Downtime (MDT). MTTR is the average time to repair assets. It’s pure repair time (called by some wrench time).

In contrast, MDT is the total time the asset is down, which includes repair time plus additional waiting delays.

In simple terms, maintainability usually refers to those features of assets, components, or total systems that contribute to the ease of maintenance and repair. A lower MTTR generally indicates easier maintenance and repair.

FIG. 2a Trending of MTBF Data.

Figure 2 a, b, and c show trends of MTBF and MTTR data in hours.

The baseline should be based on at least one year of data, dependent on your operations (could require as much as three years of data for assets with minimal operating time). This type of trend line is essential for tracking impact of improvements. FIG. 2a shows MTBF trend data, which is increasing. This trend is a good one.

FIG. 2b shows MTTR trend data, which is increasing. It’s going in the wrong direction. We need to evaluate why MTTR is increasing by asking: Do we have the right set of skills in our work force? Do we identify and provide the right materials, tools, and work instructions? What can we do to reverse the trend?

FIG. 2b Trending of MTTR (I) Data

FIG. 2c Trending of MTTR (D) Data

FIG. 2c shows MTTR trend data, which is decreasing. In this case, the trending is in the right direction. To continue this trend, we need to ask the questions: What caused this to happen? What changes did we make? Trending of this type of data can help to improve the decision process.

Availability

Availability (A) is a function of reliability and maintainability of the asset. It’s measured by the degree to which an item or asset is in an operable and committable state at the start of the mission when the mission is called at an unspecified (random) time.

In simple terms, the availability may be stated as the probability that an asset will be in operating condition when needed. Mathematically, the availability is defined:

Availability (A) =

MTBF =

Uptime MTBF+MTTR Uptime + Downtime

The availability defined above is usually referred to as inherent avail ability (Ai). It’s the designer's best possible option.

In reality, actual availability will be lower than inherent availability as the asset will be down due to preventive and corrective maintenance actions. Another term, Operational Availability (Ao), considers both preventive and corrective maintenance and includes all delays - administrative, materials and tools, travel, information gathering, etc. - that keep the asset unavailable. Achieved Availability (Aa) includes preventive maintenance, but not delays for getting materials and tools, information, etc.

Naturally, the designer or the manufacturer of the asset should be responsible for inherent or achieved availability. The user of the asset should be interested in operational availability. The inherent availability will be degraded as we use the asset and it can never be improved upon without changes to the hardware and software. Availability can be improved by increasing reliability and maintainability. Trade-off studies should be performed to evaluate cost effectiveness of increasing MTBF (reliability) or decreasing MTTR (maintainability). For the sake of simplicity and to reduce confusion, we will be using the term Availability in this guide to represent inherent availability.

The standard for availability is about 95%, meaning that the asset is available for 9.5 hours out of 10. This is based on general industry expectations. In some cases, if assets are not very critical, the standard may be lower. But in case of critical assets such as aero-engines or assets involved with 24-7 operations, the standard may require 99% or higher availability.

In general, the cost to achieve availability above 95% increases exponentially. Therefore, we need to perform operational analysis to justify high availability requirements, particularly if it's over 97 percent.

The Bathtub Curve and Reliability Distribution

The bathtub curve seen in FIG. 3 is widely used in reliability engineering, although the general concept is also applicable to people as well.

The curve describes a particular form of the hazard function which comprises three parts:

• The first part is a decreasing failure rate, known as early failures or infant mortality. It's similar to our childhood.

• The second part is a constant failure rate, known as random failures. It's similar to our adult life.

• The third part is an increasing failure rate, known as wear-out failures. It's similar to our old age.

The bathtub curve is generated by mapping the rate of early infant mortality failures when first introduced, the rate of low random failures with constant failure rate during its useful life, and finally the rate of wear out failures as the asset approaches its design lifetime.

In less technical terms, in the early life of an asset adhering to the bathtub curve, the failure rate is high. However, it quickly decreases as defective components are identified and discarded, and early sources of potential failure, including installation errors, are eliminated. In the mid life of an asset, the failure rate is generally low and constant. In the late life of the asset, the failure rate increases, as age and wear take its toll.

The airline industry and the U.S. Navy performed studies in the 1960s and 1970s to have a better understanding of asset failures. These studies showed that these types of assets remain fairly close, even though they don’t all follow the bathtub curve failure concept exactly. All assets followed a constant or slightly increasing failure rate for most of their life.

Some didn't follow early mortality rate and some didn't have a wear out region either. FIG. 4 shows a series of these failure patterns based on original study data. The failure patterns are categorized into two groups -- age-related and random. Fewer than 20% of failures follow an age degradation pattern; the remaining follow a random pattern with constant failure rate.

=====

FIG. 3 The Bathtub Curve

End of Life Wear-Out Increasing Failure Rate Infant Mortality Decreasing Failure Rate Increased Failure Rate Normal Life (Useful Life) Low "Constant" Failure Rate; The Bathtub Curve Hypothetical Failure Rate versus Time; Time

=====

FIG. 4 Failure Patterns

Reliability Failure Distribution

The exponential distribution is one of the most common distributions used to describe the reliability of an asset or a component in a system. It models an asset or component with the constant failure rate, or the flat section of the bathtub curve. Most of the assets, consumer or industrial, follow the constant failure rate for their useful life, so exponential distribution is widely used to estimate the reliability. The basic equation for estimating reliability, R(t),is …

…Where…

? (lambda) = Failure rate = 1/MTBF t = mission time, in cycles, hours, miles, etc.

(Note: e is base of the natural logarithm = 2.71828)

Calculating Reliability and Availability

Example 1

A hydraulic system, which supports a machining center, has operated 3600 hours in the last two years. The plant's CMMS system indicated that there were 12 failures during this period. What is the reliability of this hydraulic system if it’s required to operate for 20 hours or for 100 hours?

MTBF = operating time / # of failures = 3600 / 12 = 300 hours

Failure rate = 1 / MTBF = 1/ 300 = 0.003334 failures / hour

Reliability for 20 hours of operations,

= 93.55%

Reliability for 100 hours of operations,

= 71.65%

For 100 hours of operation, the reliability of the hydraulic system is 71.65%. This means that there is a 71.65% probability that the hydraulic system will operate without a failure. If we need to operate the system for only 20 hours, however, the probability of failure-free operations will increase to 93.55%.

Now, let us suppose that there is a need to operate this hydraulic system for 100 hours to meet a key customer's needs and the current reliability of 71.65% is not acceptable. The system needs to have 95% or better assurance (probability) to meet the customer's need.

To have reliability requirements of 95% for 100 hours of mission time, we need to calculate a new failure rate, ?. We use the reliability equation,

Required Reliability = 0.95 = R (100) = e (-? x 100) Solving this equation, gives us (? x 100) = 0.05 100 ? = 0.05

Thus, Failure rate ? = 0.0005 or MTBF = 2000 hours

This indicates that the failure rate needs to be dropped from 0.00334 (or an MTBF of 300 hours) to a new failure rate of 0.0005 (or an MTBF of 2000 hours). If we consider the same 3600 operating hours, then the number of failures needs to be reduced from 12 to 1.8. A root cause failure or FMEA analysis needs to be performed on this hydraulic system to identify unreliable components. Some components may need to be re designed or replaced to achieve the new MTBF of 2000 hours.

Example 2

A plant's air compressor system operated for 1000 hours last year. The plant's CMMS system provided the following data on this system:

Operating time = 1000 hours

Number of failures, random = 10

Total hours of repair time = 50 hours

FIG. 5 Compressor Failure Data

FIG. 6 Compressor Failure and Repair Time Data

What's the availability and reliability of this compressor system if we have to operate this unit for 10, 20, or 100 hours? FIG. 5 shows the failure data and FIG. 6 shows repair time data for those failures.

FIG. 5 shows that the first failure happened at 100 hours of operation, the second at 152 hours of operation, and so forth. FIG. 6 shows that the first failure happened at 100 hours of operation and took 2 hours to repair; the second failure happened at 152 hours of operation and took 6 hours to repair; and so forth. The total repair time for 10 failures is 50 hours.

Calculating MTBF and Failure Rate MTBF =

Operating Time =1000 hours

= 100 hours

# of Failures 10 failures

This indicates that the average time between failures is 100 hours.

11 Failure Rate (? - Lambda) = = = 0.01 Failures /Hour MTBF 100

Calculating MTTR and Repair Rate Calculating Availability Earlier, we calculated, Mean Time Between Failures (MTBF) = 100 hours Mean Time To Repair (MTTR) = 5 hours Then, or

This means that the asset is available 95% of the time and is down for 5% of the time for repair.

Calculating Reliability

As calculated earlier for the compressor unit, MTBF = 100 hours

Failure Rate ? (FR) =

So, for this compressor system with MTBF of 100 hours, Reliability for 10 hours of operation = 90% Reliability for 20 hours of operation = 82% Reliability for 100 hours of operation = 37%

This data indicates that the reliability of the air compressor unit in this example is 90% for 10 hours of operation. However, reliability drops to 37% if we decide to operate the unit for 100 hours. For 20 hours of operation, reliability is 82%. If this level of reliability is not acceptable, then we need to perform root cause failure or FMEA analysis to determine what component needs to be redesigned or changed to reduce the number of failures, thereby increasing reliability.

Reliability Block Diagram (RBD)

The failure logic of an asset, components, or a group of assets and components called a system can be shown as a reliability block diagram (RBD). This diagram shows logical connections among the system's components and assets. The RBD is not necessarily the same as a schematic diagram of the system's functional layout. The system is usually made of several components and assets which may be in series, parallel, or combi nation configurations to provide us the designed (inherent) reliability. The RBD analysis consists of reducing the system to simple series and parallel component and asset blocks which can be analyzed using the mathematical formulas.

FIG. 7 shows a simple diagram, using two independent components and assets to form a system in series.

FIG. 7 An Example of Series System

The reliability of a system with multiple components in series is calculated by multiplying individual component reliabilities,

... And the reliability of system Rsys12 as shown in FIG. 7 Rsys ... where ? is failure rate and t is the mission time.

FIG. 8 An Example of a Parallel System

Active Redundancy or Parallel System

The RBD for the simplest redundant system is shown in FIG. 8. This system is composed of two independent components and assets with reliability of R3 and R4

The reliability of a parallel system as shown is often written as ...

In this arrangement, the reliability of the system, Rsys34 is equal to the probability of component 3 or 4 surviving. It simply means that one of the components is needed to operate the system and the other component is in active state and available if the first one fails. Therefore, the reliability of the whole system in parallel configuration is much higher than in series configuration. The components in parallel improve system reliability whereas components in series lower system reliability.

Standby redundancy is achieved when, in a redundant system, the spare component is not in an active mode continuously, but gets switched on only when the primary component fails. In standby mode, the resultant reliability is a little higher in comparison to active mode. However, the assumption is made that switching is done without failure or without any delay. The reliability of a two component system in standby mode is: Rsys-standby = e

...

Example 3

In a two-component parallel system with a failure rate of 0.1 /hour of each component, what would be the active and standby reliability of the system for one hour of operation?

In real application, there will be many components and assets in series and parallel arrangements, depending upon design requirements. For example, FIG. 9 shows a typical system comprised of 13 components, or individual assets, arranged in a combination of series and parallel con figuration. The system reliability can be determined by calculating first the individual component and asset reliability, then the system's subsystems, and finally the system as a whole. Reliability of some of the subsystems that are in parallel arrangements can be calculated using m-out-of-n reliability formulas. This means how many m legs (components in series) are necessary out of n legs for the system to operate properly. In FIG. 9, subsystem B has three legs, but we need only one to operate the system. Similarly, subsystem C has three components and assets in parallel, and we need two to operate.

A simple approach for calculating the reliability of m-out-of-n systems is utilizing the binomial distribution and the relationship (R + Q) n = 1, where R is reliability, Q is unreliability, and n is the number of elements. FIG. 10 shows the formula for 2-, 3-, and 4-element systems to calculate system reliability.

FIG. 9 An Example of Multiple Component System RBD

FIG. 10 System Reliability for m-out-of-n components

Therefore,

MTBF of total system = 1/? = 1/0.001443 = 693 hours

Based on 16 hours of operation, the reliability of each component is calculated and shown in the reliability block diagram in FIG. 12. The system reliability

Rsys = R1 x R2 x R3 x R4 x R5 x R6 x R7

Figures 6.11 Compressor X1 system with major components and failure rates

Figures 6.12 Reliability Block Diagram for Compressor X1 System

Substituting individual reliability, the system reliability

Rsys = 0.9978 x 1 x 1 x 0.9989 x 0.9934 x 0.9891 x 0.9978 = 0.9772

Thus the reliability of the compressor unit is 97.72% based on 16 hours of operation. Now, let us assume that there are two compressor systems X1 and X2 in the facility, as shown in FIG. 13, with reliability of

X1 = 0.9772 X2 = 0.8545

FIG. 13

Two compressors system arrangement

These reliability levels are based on 16 hours, 2-shift operation scenarios. Let us also assume that most of the time, say 85%, we need only one compressor to meet our production needs. The second compressor will work as active standby. However, for 15 percent of the time, we may need both of the compressor units. During that time, both compressors will be in series arrangement.

The reliability of one unit (85% of time needing only one compressor)

FIG. 14 Example of a Reliability Block Diagram at Plant Systems

The reliability of two units (15% of time needing both compressors)

So, when there is need for only one compressor unit, we are 98% reliable. However, when there is need for both compressors, we are only 85% reliable to meet the customer's needs. This level may be acceptable. If not, we may need to redesign or replace some of the components in compressor X2 to make it more reliable.

Similarly, a reliability block diagram for a process, a manufacturing line, or a plant could be developed, as shown in FIG. 14. This type of reliability block diagram can provide the information needed to improve the reliability of systems in the plant.

4 Designing and Building for Maintenance and Reliability Asset Life Cycle Cost

Life cycle costs (LCC) are all costs expected during the life of an asset. The term refers to all costs associated with acquisition and owner ship, specifically operations and maintenance, of the asset over its full life, including disposal. FIG. 15 shows a typical asset life cycle chart.

The total cost during the life of an asset includes:

• Acquisition Cost

• Design and Development

• Demonstration and Validation (mostly applicable to one-of-a kind, unique systems)

• Build and Installation (including commissioning)

• Operations and Maintenance (O&M)

• Operating Cost (including energy and supplies)

• Maintenance Cost

• PM

•CM

• Disposal

Based on several studies reported, the distribution of estimated LCC is as follows:

For a Typical DoD System* Industrial Design and Development 10 - 20 % 5 - 10 % Production / Fabrication / Installation 20 - 30 % 10 - 20% Operations and Maintenance (O&M) 50 - 70 % 65 - 85% Disposal < 5 % < 5 %

Assurance Technologies Principles and Practices:

A graph showing the typical cost commitment and expenditures for the life of an asset is shown in FIG. 16, as reported by Paul Barringer, a leading reliability expert. It’s clear from the figure that the O&M cost is on average about 80% of the total life cycle cost of the asset. It’s obviously important that we need to minimize operations and maintenance (O&M) costs. As shown in the chart, the major portion of the O&M cost becomes fixed during early design and development phase of the asset.

There are ample opportunities to reduce the LCC during the design, building, and installation of the asset.

FIG. 15 Asset Life Cycle

FIG. 16 Cost Commitment and Expenditures during an Asset Life Cycle

Assets should be designed so that they can be operated and maintained easily with minimum operations and maintenance needs. As discussed earlier in this section, reliability and maintainability are design attributes; they should be designed in, rather than added later.

To have reliable and easy-to-maintain assets, we need to insure that asset owners, including operators, are involved in developing the requirements as well as in reviewing the final design. In designing for reliability and maintainability, attention must focus on:

• Reliability requirements and specifications

• Designing for reliability and maintainability

• Proper component selection and configuration to guarantee required reliability and availability

• Review design for maintainability

• Logistics support - maintenance plan and documentation to reduce MTTR

• Reducing the operations and maintenance costs

Reliability Requirements and Specifications

In order to develop a reliable asset, there must be good reliability requirements and specifications. These specifications should address most, if not all, of the conditions in which the asset has to operate, including mission time, usage limitations, and operating environment. In many instances, developing these specifications will require a detailed description of how the asset is expected to perform from a reliability perspective. Use of a single metric, such as MTBF, as the sole reliability metric is inadequate.

Even worse is the specification that an asset will be "no worse" than the existing or earlier model. An ambiguous reliability specification leaves a great deal of room for error, resulting in poorly-understood design requirements and an unreliable asset in the field.

Of course, there may be situations in which an organization lacks the reliability background or history to properly define specifications for asset reliability. In these instances, an analysis of existing data from previous or similar assets may be necessary. If enough information exists to characterize the reliability performance of a similar asset, it should be a relatively simple matter to transform this historical reliability data into specifications for the desired reliability performance of the new asset.

Indeed, the financial concerns will have to be taken into account when formulating reliability specifications. What reliability can we afford? How many failures can we live with? Do we need to have zero failures? Zero failures is a great goal, but can we justify the cost in achieving it? A proper balance of financial constraints and realistic asset reliability performance expectations are necessary to develop a detailed and balanced reliability specification.

Key Elements of Reliability Specifications

• Probability of successful performance

• Function (mission) to be performed

• Usage time (mission time)

• Operating conditions

• Environment

• Skill of operators / maintainers

An example of reliability requirements for an automotive system consists of an engine, a starter motor, and a battery.

There shall be a 90% probability (of success) that the cranking speed is more than 85 rpm after 10 seconds of cranking (mission) at – 20 degr. F of (environment) for a period of 10 years or 100,000 miles (time). The reliability shall be demonstrated at 95% confidence.

Let us take another example of a manufacturing cell / system that needs to produce xyz product, at a rate of ## / hour or day, at a quality level of Qx. The reliability-related requirements can be developed using operational data and some assumptions. A suggested approach is:

• Define operating environment / duty cycle, i.e.,

• 20 hours/day and 250 days/year or 5000 hours/year

• Expected number of failures < 5/yr (This is an assumption - what can we afford? Can we live with fewer than 5 failures/year?)

• Reliability and Maintainability requirements (based on above data and assumptions)

• MTBF = 5000/5 = 1000 hours; FR = 1/1000 = 0.001 failures/hr

• Estimated repair time or MTTR can be calculated based on the following assumptions

• 3 failures @ < 2 hours = 6 hours

• 1 failure @ < 10 hours = 10 hours

• 1 failure @ < 24 hours = 24 hours Therefore, the required MTTR = 40/5 = 8 hours

• Reliability and Availability requirements:

• Reliability for 20 hours/day operation

• R20 = e -(0.001x20) = 98%

• Availability = MTBF / (MTBF+MTTR) = 1000/1008 = 99%

• Required (desired) operating costs

• 2 man-hour / hour of operation (currently is 3 man-hour / hour)

• Energy plus other utility cost

• 20% less than current (current usage is 2 MW plus other)

• Maintenance cost (preventive and corrective)

• 2% or less of Replacement Asset Value (currently 2.7% increasing by 0.2% per year)

Based on the calculations and data above, we can specify the following requirements for this new system we are procuring.

• MTBF of 1000 hours or FR = 0.001 failures/hr

• MTTR of 8 hours

Or we can ask reliability of 98% for 20 hours of operations/day and availability of 99%. Similarly, we can specify that total operating cost and maintenance costs may not exceed some number or percent of system replacement value. However, these numbers should be validated by system designers/builder by performing an FMEA.

In addition, the requirements and specifications should include:

• Display of asset performance data - such as early warnings

• Current, Temperature, Pressure, etc.

• Other operating / asset condition data

• Diagnostic display - pinpointing problem areas

• Use of modular and standard components

• Use of redundant parts / components to increase reliability

• Minimize special tools - parts

• Operations and maintenance training material

• FMEA / RCM-based maintenance plan

• Maximum use of CBM technologies

• Basis of spares recommendations

• Life Cycle cost analysis

• O&M cost estimates Reliability Approach in Design

It has been found that as much as 60% of failures and safety issues can be prevented by making changes in design. Assets must be:

• Designed for fault tolerance

• Designed to fail safely

• Designed with early warning of the failure to the user

• Designed with a built-in diagnostic system to identify fault location

• Designed to eliminate all or critical failure modes cost effectively, if possible The following analyses are recommended to be performed during the design phase - from conceptual design to final design.

• Reliability Analysis

• Lowers asset and system failures over the long term

• System reliability depends on robustness of design, as well as quality and reliability of its components

• Maintainability Analysis

• Minimizes downtime - reduces repair time

• Reduces maintenance costs

• System Safety and Hazard Analysis

• Identifies, eliminates, or reduces safety-related risks through out its life cycle

• Human Factors Engineering Analysis

• Prevents human-induced errors or mishaps

• Mitigates risks to humans due to interface errors

• Logistics Analysis

• Reduces field support cost resulting from poor quality, reliability, maintainability, and safety

• Insures availability of all documentation, including PM plan, spares, and training needs The following checklist is recommended as a guide to review the design and make sure that it adequately addresses reliability, maintainability, and safety issues.

Design Reviews Checklist

• Are reliability, maintainability, availability, and safety analysis performed?

• Is Failure Modes and Effects Analysis (FMEA) performed during the design - at preliminary design reviews (PDR) and critical design reviews (CDR)?

• Can fault-free analysis be used to improve the design?

• Is fault-tolerant design considered?

• Are components interchangeabilities analyzed?

• Is modular design considered?

• Are redundancies considered to achieve desired reliability?

• Has the design been critiqued for human errors?

• Are designers familiar with the human engineering guidelines?

• Is Reliability-Centered Maintenance (RCM) considered in design?

• Is a throwaway type of design considered instead of repair (e.g., light bulbs)?

• Has built-in testing and diagnostics been considered?

• Are self-monitoring and self-checking desirable?

• Are components and assets easily accessible for repair?

• Are corrosion-related failures analyzed?

• Do components need corrosion protection?

• Is zero-failure design economically feasible?

• Is damage detection design needed?

• Is software reliability specified and considered in design?

• Is fault-isolation capability needed?

• Do electronic circuits have adequate clearances between them?

• Are software logic concerns independently reviewed?

• Has software coding been thoroughly reviewed?

• Is self-healing design feasible or required?

• Are redundancies considered for software?

• Are the switches for backup devices reliable? Do they need maintenance?

• Are protective devices such as fuses, sprinklers, and relief valves reliable?

• Does the asset need to withstand earthquakes and unusual loads? If yes, are design changes adequate?

• Can manufacturing/fabrication or maintenance personnel intro duce any defects? Can they be prevented by design?

• Can the operator introduce wrong inputs - wrong switching or overloads, etc.? If so, can the asset be designed to switch to a fail-safe mode?

• Can a single component cause the failure of a critical function? If yes, can it be redesigned?

• Are there unusual environments not already considered? If hazardous material is being used, how will it be contained or handled safely?

• Is crack growth and damage tolerance analysis required?

• Are safety margins adequate?

• Are inspection provisions made for detecting cracks, damage, and flaws?

• Are production tests planned and reviewed?

• How will reliability be verified and/or validated?

5 Summary

Improving reliability is essential to the success of any organization, particularly to its operation and maintenance. Understanding reliability and maintenance and how they're interrelated are the basis for reducing the life cycle costs of assets and plant.

Reliability focuses on the ability of an asset to perform its intended function of supporting manufacturing a product or providing a service.

Reliability terminates with a failure -- i.e., when unreliability occurs.

Unreliability results in high cost to the organization. The high cost of unreliability motivates an engineering solution to control and reduce costs.

Maintenance is an act of maintaining, or the work of keeping the asset in proper operational condition. It may consist of performing maintenance inspection and repair to keep assets operating in a safe manner to produce or provide designed capabilities. Thus, maintenance keeps assets in an acceptable working condition, prevents them from failing, and, if they fail, brings them back to their operational level effectively and as quickly as needed.

Reliability should be designed in. It’s a strategic task. In contrast, maintenance keeps the asset functioning and is a tactical task. The reliability and maintainability attributes are usually designed into the product or asset. These attributes minimize maintenance needs by using reliable components, simpler replacements, and easier inspections. Reliability is measured by MTBF, which is the inverse of failure rate. Maintainability - the ease of maintenance - is measured by MTTR. It has been found that the Operations and Maintenance (O&M) costs are about 80% or more of the total life cycle cost of an asset. It’s important to minimize O&M costs. The majority of the O&M costs to be incurred in the future are set during the design and development phase of the asset. Therefore, we must adequately address reliability, maintainability, and safety aspects of the system in order to reduce the overall life cycle cost of the assets during the design and building of the assets.

6 QUIZ

__1 Define reliability and maintainability.

__2 What's the difference between maintenance and maintainability?

__3 If an asset is operating at 70% reliability, what do we need to do to get 90% reliability? Assume assets will be required to operate for 100 hours.

__4 If an asset has a failure rate of 0.001failures/hour, what would be the reliability for 100 hours of operations?

__5 What would be the availability of an asset if its failure rate is 0.0001failures/hour and average repair time is 10 hours?

__6 What would be the availability of a plant system if it’s up for 100 hours and down for 10 hours?

__7 If an asset's MTBF is 1000 hours and MTTR is 10 hours, what would be its availability and reliability for 100 hours of operations?

__8 Define availability. What strategies can be used to improve it?

__9 What is the impact of O&M cost on the total life cycle cost of an asset?

__10 What approaches could we apply during the design phase of an asset to improve its reliability?

+++++++++

Prev. | Next

Article Index    HOME   Project Management Articles