Maintenance Optimization [part 1]

HOME | FAQ | Books | Links

AMAZON multi-meters discounts AMAZON oscilloscope discounts

"Innovative practices combined with true empowerment produce phenomenal results."

  • 1 Introduction
  • 2 Terminology
  • 3 Understanding Failures and Maintenance Strategies
  • 4 Maintenance Strategy - RCM
  • 5 Maintenance Strategy - CBM
  • 6 Other Maintenance Strategies
  • 7 Summary
  • 8 QUIZ

Learning goals:

  • • What is a failure?
  • • What is RCM?
  • • What does it take to implement RCM effectively?
  • • What CBM technologies are available?
  • • What are the different maintenance strategies?
  • • How can you integrate PM and CBM into RCM methodology
  • • When would RTF be a good maintenance strategy?

1. Introduction

Maintenance has entered the heart of many organizational activities due to its vital role in the areas of environment preservation, productivity, quality, system reliability, regulatory compliance, safety, and profitability.

With this new paradigm, new challenges and opportunities are being presented to maintenance and operations professionals. Central to maintenance is a process called Reliability Centered Maintenance, or RCM. RCM helps determine how assets can continue to do what their users require in certain operating contexts. RCM analysis provides a structured framework for analyzing the functions and potential failures of assets such as airplanes, manufacturing lines, compressors or turbines, telecommunication systems, etc. RCM was developed in the commercial aviation industry in the late 1960s to optimize maintenance and operations activities. RCM strategy (or some call it process), can help in developing an effective maintenance plan by selecting appropriate strategies such as PM, CBM, or RTF. Preventive maintenance (PM) is the planned maintenance of assets designed to improve asset life and avoid unscheduled maintenance activity. PM includes cleaning, adjusting, and lubricating, as well as minor component replacement, to extend the life of assets and facilities.

Condition-based maintenance (CBM) is another maintenance optimizing strategy. CBM attempts to evaluate the condition of assets by per forming periodic or continuous condition monitoring. The ultimate goal of CBM is to perform maintenance at a scheduled point in time when the maintenance activity is most cost-effective and before the asset loses optimum performance.

Recent developments in technologies have allowed instrumentation of assets to provide us information regarding its health. Together with better tools for analyzing condition data, today's maintenance personnel are better able to decide the right time to perform maintenance on assets. Ideally, CBM allows maintenance personnel to do the right things - minimizing asset downtime, time spent on maintenance, and spare parts cost. CBM uses real-time data to prioritize and optimize resources.

Although many would not consider it to be a maintenance-optimizing strategy, Run-to-Failure (RTF) can be a viable and economical choice for certain equipment. Selecting RTF should be a deliberate choice because it will lead to unplanned downtime and increased corrective maintenance cost for the specific equipment selected for this maintenance strategy.

However, if the facility and personnel risk is low, RTF may be the most cost-effective strategy for an organization's overall maintenance program.

The key to optimizing your facility's maintenance program is to make the best choice for each piece of the equipment as well as the facility as a whole, although the best application of RCM is during design and development of equipment to eliminate or mitigate effects of failure modes.

2. Terminology

Age Exploration

An iterative process used to optimize preventive maintenance (PM) intervals.

Condition-Based (or Predictive) Maintenance (CBM / PdM)

Maintenance based on the actual condition (health) of assets obtained from in-place, non-invasive measurements and tests.

Condition-Directed (CD) Tasks

Tasks directly aimed at detecting the onset of a failure or failure symptom.

Corona (Partial Discharge)

The term corona is used as a generic name for any electrical discharges that take place in an energized electrical insulation as the result of accelerated ionization under the influence of the electric field in the insulation. It’s defined as a type of localized discharge resulting from transient gaseous ionization in an insulation system when the voltage stress exceeds a critical value.

Critical Asset

Assets that have been evaluated and classified as critical due to their potential impact on safety, environment, quality, production/operations, and maintenance if failed.


A fundamental property of a material, emissivity is the ratio of the rate of radiant energy emission at a given wavelength from a body with an optical smooth surface, as a consequence of its temperature only, to the corresponding rate of emission from a black body at the same temperature and wavelength.


Failure is the inability of an asset / component to meet its expected performance.

Failure Cause

The reason something went wrong.

Failure Effect (Consequences)

What happens when a failure mode occurs; its consequences.

Failure-Finding (FF) Tasks

A scheduled task that seeks to determine if a hidden failure has occurred or is about to occur.

Failure Mode An event that causes a functional failure; the manner of failure

Failure Mode Effect and Analysis (FMEA)

A technique to examine an asset, process, or design to deter mine potential ways it can fail and the potential effects; and subsequently identify appropriate mitigation tasks for highest priority risks.


An analytical method of assessing machine health by quantifying and examining ferrous wear particles suspended in the lubricant or hydraulic fluid.

Functional Failure

A state in which an asset / system is unable to perform a specific function to a level of performance that is acceptable to its user.

Hidden Failure

A failure mode that won’t become evident to a person or the operating crew under normal circumstances.

Operating Context The environment in which an asset is expected to be used.

P-F Interval

The interval between the point at which a potential failure becomes detectable and the point at which it degrades into a functional failure. It’s also sometime called lead time to failure.

Potential Failure

A condition that indicates a functional failure is either about to occur or in the process of occurring.


A forecast or prediction of outcome such as how long this asset or component will last or remaining life left.

Reliability Centered Maintenance (RCM)

A systematic and structured process to develop an efficient and effective maintenance plan for an asset to minimize the probability of failures. The process insures safety and mission compliance.

Run-to-Failure (RTF) A maintenance strategy (policy) for assets where the cost and impact of failure is less than the cost of preventive actions. It’s a deliberate decision based on economical effectiveness.

Time-Directed (TD) Tasks

Tasks directly aimed at failure prevention and performed based on time - whether calendar-time or run-time.


Measurement of a fluid's resistance to flow. It’s also often referred to as the structural strength of liquid. Viscosity is critical to oil film control and is a key indicator of condition related to the oil and the machine.

3 Understanding Failures and Maintenance Strategies

FIG. 1 illustrates the period when a failure initiates and eventually becomes a functional failure that leads to a complete asset breakdown.

Assets perform very well in Zone A. However, somewhere in that region -- due to a lack of or reduction in lubricant supply, human error, defect in material, or some other reason -a failure is initiated at the end of Zone A. This defect may be in the form of a small crack or debris stuck in the lubricant or in the valve assembly, etc. It continues to grow in Zone B, increasing the asset's failure potential, though still unnoticed. At Point P at the beginning of Zone C, this defect becomes a Potential Failure. Then at Point F, at the end of Zone C, this potential failure creates a functional failure, and a function of the asset stops working.

However, the asset may continue to operate at a reduced capacity or functionality. By now, there will be some visual and/or measurable evidence of a functional failure. Eventually, at Point B, the asset completely shuts down.

FIG. 1 Understanding Failure

The time interval between points P and F is called the P-F interval.

In theory, the PM or on-condition tasks interval should be less than the P-F interval time to catch potential failures and correct them in time.

However, we don't have good information about where points P or F are in time. Analysis of condition and operating data can help to estimate their (time) location. Our discussion assumes that these points are fixed in time, yet this is not the case in practice. They may vary based on the nature of the defects and the environment. Our goal is to catch any defects before they shut us down.

The best strategy is to find a defect or any abnormal condition in Zone B as soon as possible, utilizing condition-based tasks. RBM and/or PM can be used to identify the sources of these defects and correct them in their early stages.

Traditional thinking has been that the goal of preventive maintenance (PM) is to preserve assets. On the surface, it makes sense, but the problem is in that mindset. In fact, that thinking has been proven to be flawed at its core. The blind quest to preserve assets has produced many problems, such as being overly conservative with any maintenance actions that could cause damage due to intrusive actions, thereby increasing the chances of human error. Other flaws include both thinking that all failures are equal and performing maintenance simply because there is an opportunity to do so.

In the last few decades many initiatives have been developed in cost reduction, resource optimization, and bottom line focus of any action we take. The mentality of preserving assets quickly consumed resources, put maintenance plans behind schedule, and overwhelmed the most experienced maintenance personnel. Worse, this mentality sometimes caused maintenance actions to become totally reactive.

The development of a Reliability-Centered Maintenance approach has provided a fresh perspective in which the purpose of maintenance is not to preserve assets for the sake of the assets themselves, but rather to preserve asset functions. At first, this might be a difficult concept to accept because it’s contrary to our ingrained mindset that the sole purpose of preventive maintenance is preserving equipment operation. But in fact, in order to develop an effective maintenance strategy, we need to know what the expected output is and the functions that the asset supports - that is, the real purpose of having the asset.

4 Maintenance Strategy - RCM

Reliability-Centered Maintenance, often known as RCM, is a process to ensure that assets continue to do what their users require in their present operating context. It’s a structured process to develop an efficient and effective maintenance plan for an asset to minimize the probability of failures.

RCM is generally used to achieve improvements in all aspects of the asset management, such as the establishment of safe minimum or optimum levels of maintenance, changes to operating procedures, and establishment of an effective maintenance plan. Successful implementation of RCM will promote cost effectiveness, asset uptime, and a better under standing of the level of risk that the organization is presently managing.

It has been demonstrated that the best application of RCM is during design and development phases of the assets to eliminate or mitigate effects of failure modes.

RCM History and Development

Reliability Centered Maintenance (RCM) is a systematic approach for developing new maintenance requirements where one does not exist, and optimizing an existing maintenance program. In both cases, the end result of the RCM analysis is a maintenance program composed of tasks that represent the most technically correct and cost-effective approach to maintaining asset/component operability. This operability in turn lends itself to improved system reliability and plant availability. Another important result of an RCM program is a documented technical basis for every maintenance program decision.

In the late 1960s as Boeing's 747 jumbo jet was becoming a reality, all owners/operators of the aircraft were required to provide a PM pro gram to FAA for approval in order to get certified for operation. No air craft can be sold without this type of certification. The recognized size of the 747, about three times as many passengers as the 707 or DC-8, and its many technological advances in structure and avionics led the FAA to take the position at first that the preventive maintenance on the 747 would be very extensive. In fact, airlines thought that they may not able to operate this aircraft in a profitable manner with that requirement.

This development essentially led the commercial aircraft industry to undertake a complete re-evaluation of their preventive maintenance strategy. Bill Mentzer, Tom Matteson, Stan Nowland, and Harold Heap of United Airlines led the effort. What resulted was an entirely new approach that employed a decision-tree process for ranking PM tasks that were necessary to preserve critical aircraft functions during flights. This new technique was defined and explained in Maintenance Steering Group 1 (MSG1) for the 747 and was subsequently approved by the FAA. With MSG-1 success, its principles were applied to other aircrafts such as DC-10, MD-80/90, and Boeing 757 / 777; to Navy P-3 and S3; and to Air Force F-4J aircrafts under a contract with the U.S. Department of Defense (DOD). In 1975, DOD directed that the MSG concept be labeled Reliability-Centered Maintenance (RCM) and be applied to all major military systems. In 1978, United Airlines produced the initial RCM "bible" under DOD contract.

RCM development has been an evolutionary process. Over 40 years have passed since its inception during which RCM has become a mature process. However, industry has yet to fully embrace the RCM methodology in spite of its proven track record. In recent years, Anthony (Mac) Smith and Jack Nicholas have been leaders in creating increased RCM awareness. Examples discussed in this section are the result of work performed by Mac Smith, Glen Hinchcliffe and the author in optimizing PMs utilizing RCM methodology.

The RCM Principles

There are four principles that define and characterize RCM, and set it apart from any other PM planning process.

Principle 1: The primary objective of RCM is to preserve system function.

This principle is one of the most important and perhaps the most difficult to accept because it’s contrary to our ingrained notion that PMs are performed to preserve equipment operation. By addressing system function, we want to know what the expected output should be, and also understand that preserving that output (function) is our primary task at hand.

Principle 2: Identify failure modes that can defeat the functions.

Because the primary objective is to preserve system function, the loss of function is the next item of consideration. Functional failures come in many sizes and shapes; they are not always as simple as, "we have it or we don't." For example, the loss of fluid boundary integrity in a pumping system illustrates this point. A system loss of fluid can be 1) a very minor leak that may be qualitatively defined as a drip; 2) a fluid loss that can be defined as a design basis leak - that is, any loss beyond a certain flow value will produce a negative effect on system function, but not necessarily total loss; or 3) a total loss of boundary integrity, which can be defined as a catastrophic loss of fluid and loss of function. In this example, a single function - preserve fluid integrity - led to three functional failures.

The key point of Principle 2 is to identify the specific failure modes in a specific component that can potentially produce those unwanted function al failures.

Principle 3: Prioritize function needs (failures modes). All functions are not equally important. A systematic approach is taken to prioritize all functional failures and failure modes using a priority assignment rationale.

Principle 4: Select applicable and effective tasks.

Each potential PM or CBM task must be judged as being applicable and effective. Applicable means that if the task is performed, it will accomplish one of three reasons of doing PM or CBM:

1. Prevent or mitigate failure.

2. Detect onset of a failure.

3. Discover a hidden failure.

Effective means that we are sure that this task will be useful and we are willing to spend resources to do it. In addition, RCM recognizes the following:

Design Limitations. The objective of RCM is to maintain the inherent reliability of system function. A maintenance program can only maintain the level of reliability inherent in the system design; no amount of maintenance can overcome poor design. This makes it imperative that maintenance knowledge be fed back to designers to improve the next design. RCM recognizes that there is a difference between perceived design life (what the designer thinks the life of the system is) and actual design life. RCM explores this through the Age Exploration (AE) process.

RCM Is Driven by Safety First, then Economics. Safety must be maintained at any cost; it always comes first in any maintenance task.

Hence, the cost of maintaining safe working conditions is not calculated as a cost of RCM. Once safety on the job is ensured, RCM assigns costs to all other activities.

Elements of RCM

The SAE JA1011 standard describes the minimum criteria to which a process must comply to be called RCM. An RCM Process answers the following seven essential questions:

1. What are the functions and associated desired standards of performance of the asset in its present operating context (functions)?

2. In what ways can the asset fail to fulfill its functions (functional failures)?

3. What causes each functional failure (failure modes)?

4. What happens when each failure occurs (failure effects)?

5. In what way does each failure matter (failure consequences)?

6. What should be done to predict or prevent each failure (proactive tasks and task intervals)?

7. What should be done if a suitable proactive task cannot be found (default actions)?

Unlike some other maintenance planning approaches, RCM results in all of the following tangible actionable options:

• Maintenance task schedules, which can include:

• Time Directed (TD) tasks, (Calendar/run time based PMs)

• Condition Directed (CD) tasks, (CBM/PdM tasks)

• Failure Finding (FF) tasks (operator supported tasks)

• Run-to-Failure (RTF) tasks (economical decision based)

• Revised operating procedures for the operators of the assets, which might include service-type tasks such as changing filters, taking oil samples, and recording operating parameters

• A list of recommended changes to the design of the asset that would be needed if a desired performance is to be achieved

RCM shifts the emphasis of maintenance from the idea that all failures are bad and must be prevented, to a broad understanding of the purpose of maintenance. It seeks the most effective strategy that focuses on the performance of the organization. It might include not doing something about a failure or letting failures happen. The RCM approach encourages us to think of more encompassing ways of managing failures.

RCM Analysis Process

Although RCM has a great deal of variation in its application, most procedures include some or all of the following nine steps:

1. System selection and information collection

2. System boundary definition

3. System description and functional block diagram

4. System functions and functional failures

5. Failure mode and effects analysis (FMEA)

6. Logic (decision) tree analysis (LTA)

7. Selection of maintenance tasks

8. Task packaging and implementation

9. Making the program a living one - continuous improvements

Step 1: System Selection and Information Collection

The purpose of Step 1 is to assure that the RCM team has sufficiently evaluated their area to know which systems are the problems or so-called bad actors. The team can use Pareto analysis (80/20 rule) to determine the list of problems, using criteria of highest total maintenance costs (CM+PM), down time hours, and number of corrective actions. Identifying these systems defines the dimensions of the RCM effort that will provide the greatest Return-on-Investment. FIG. 2 lists a plant's asset data - failure frequency, downtime, and maintenance costs - to help us decide which assets are the right candidates for RCM analysis. The first few assets listed in FIG. 2 are good candidates for RCM efforts.

FIG. 2 Plant Failure and Cost Data by Assets

FIG. 3 System Boundary Definition

Selection of RCM team members is a key element in executing a successful RCM program. The team should include the following:

• System operator (craft)

• System maintainer (craft - mechanical / electrical / controls)

• Operations / Production engineer

• Systems / Maintenance engineer (mechanical / electrical)

• CBM/PdM specialist or technician

• Facilitator

The use of facilitators is recommended to support RCM efforts. They ensure that the RCM analysis is carried out at the proper level, no important items are overlooked, and the results of the analysis are properly recorded. Facilitators also manage issues among the team members, helping them reach consensus in an orderly fashion, retaining the members' commitment, and keeping them engaged.

Another objective of Step 1 is to collect information that will be required by the team as they perform the system analysis. This information includes schematics, piping and instrumentation diagrams (P&ID), vendor manuals, specification and system descriptions, operating instructions, and maintenance history.

Step 2: System Boundary Definition

After a system has been selected, the next step is to define its boundaries - understanding the system as a whole and its functional sub-systems. This step assures that there are no overlaps or gaps between adjacent systems. We need to have a clear record for future reference on exactly what was defined within the system. In addition, we must specify the boundaries in precise terms; a key portion of analysis depends on defining exactly what is crossing the boundaries, both "incoming - IN" and "outgoing - OUT" interfaces respectively.

An example of a system boundary is shown in FIG. 3 Step 3: System Description and Functional Block Diagram Step 3 requires identifying and documenting the essential details of the system.

It includes the following information:

  • • System description
  • • Functional block diagram
  • • IN / OUT interfaces
  • • System work breakdown structure
  • • Equipment /component history

A well-documented system description will record an accurate base line definition of the system as it existed at the time of the RCM analysis.

Various design and operational changes can occur over time. Therefore, the system must be base-lined to identify where PM task revision might be required in the future. Frequently it has been found that team members and analysts may have only a superficial knowledge of the system.

Recording a detailed system functional description narrative will assure that the team has a comprehensive review of the system.

In addition, documenting the following information can be helpful in analyzing the data later on:

• Redundancy Features: Equipment / component redundancy, alternate mode operations, design margins, and operator workaround capabilities

• Protection Features: A list of devices that are intended to prevent personnel injury or secondary system damages when an unexpected component failure occurs; it may include items such as inhibit or permissive signals, alarms, logic, and isolation

• Key Control Features: An overview of how the system is controlled; also briefly highlighting features such as automatic vs. manual, central vs. local, and various combinations of the above as they may apply

FIG. 4 Functional Block Diagram

The next item in Step 3 is to develop a Functional Block Diagram (FBD), which is a top-level representation of the major interfaces between a selected system and adjacent systems. FIG. 4 illustrates an FBD with functional interfaces including sub-systems.

In an actual team setting, it’s desirable to have a discussion first regarding various possibilities that should be considered in creating functional sub-systems. When the FBD is finalized, it will show a decision on the use of functional subsystems as well as the final representation of the IN / OUT interfaces.

Listing all components as part of the System Work Breakdown Structure (SWBS) is very desirable. The SWBS is the compilation of the line items list for the system. SWBS is a system hierarchy listing parent child relationships. In most cases, the SWBS should be what's in the CMMS for the system being analyzed. In older plants, where the reference sources could be out of date, the RCM team should perform a system walk down to assure accuracy in the final SWBS. This practice is a good one, even if the system is well documented, to help the team familiarize itself with the system.

The last item in Step 3 is to collect historical system data. It will be beneficial for the analysis team to have a history of the past 2-5 years of component and system failure events. This data should come from corrective maintenance reports or from the CMMS system. Unfortunately, it’s not uncommon to find a scarcity of useful failure event information. In many plants, the history kept is of very poor quality. Most of the time, the repair history will simply state "Repaired pump" or "Fixed pump." Improving data quality is a challenge for many organizations. If a good failure history is not available, the team can work together to develop a list of failure events over the last few years. This list of failures would help in performing the FMEA analysis in Step 5.

FIG. 5 Functions / Functional Failures

Step 4: System Functions and Functional Failures Because the ultimate goal of RCM is "to preserve system function," it’s incumbent upon the RCM team to define a complete list of system functions and functional failures. Therefore, in Step 4, system functions and functional failures are documented.

The function statement should describe what the system does - its functions. For example, a correct function might be "Maintain a flow of 1000 GPM at header 25," but not "provide a 1000 GPM centrifugal Pump for discharge at header 25." Another example would be "maintain lube oil temperature 110°F. In theory, we should be able to stand outside of a selected system boundary, with no knowledge of the SWBS for the system, and define the functions by simply what is leaving the system (OUT interfaces). The next step is to specify how much of each function can be lost, i.e., functional failures. Most functions have more than one loss condition if we have done a good job with the system description. For example, the loss condition can range from total loss and varying levels of partial loss which have different levels of plant consequences (and thus priority) to failure to start on demand, etc. The ultimate objective of an RCM analysis is to prevent these functional failures and thereby preserve function. In Step 7, this objective will lead to the selection of preventive maintenance tasks that will successfully avoid the really serious functional failures.

FIG. 6 Example of FMEA

Step 5: Failure Mode and Effects Analysis (FMEA) Step 5, Failure Mode and Effects Analysis, is the heart of the RCM process. FMEA has been used traditionally to improve system design and is now being used effectively for failure analysis that is critical to preserve system function.

By developing the functional failure - equipment matrix, Step 5 considers for the first time the connection between function and hardware.

This matrix lists functional failures from Step 4 as the horizontal elements and the SWBS from Step 3 as the vertical elements. The team's job at this point is to ascertain from experience whether each intersection between the components and functional failure contains the making of some mal function that could lead to a functional failure. The team completes the matrix by considering each component's status against all functional failures, moving vertically down the component list one at a time. After the entire matrix has been completed, it will produce a pattern of Xs that essentially constitutes a road map to guide to a more detailed analysis.

The next step in this process is to perform the FMEA (which is discussed in Section 11), considering each component and functional failures as shown in FIG. 6. FMEA addresses the second RCM principle, to "determine the specific component failures that could lead to one or more of the functional failures." These are the failures which defeat functions and become the focus of the team's attention.

In reviewing failure modes, teams can use the following guidelines in accepting, rejecting, or putting aside for later considerations:

• Probable Failure Mode. Could this failure mode occur at least once in the life of the equipment / plant? If yes, it’s retained. If no, it’s considered a rare event and is dropped out from further consideration.

• Implausible Failure Mode. Does this failure mode defy the natural laws of physics - is it one that just could not ever happen? There are usually few, if any, hypothesized failure modes in this category. But if one arises, label it as "Implausible" and drop it from further consideration.

• Maintainable Failure Mode. Certain failure modes clearly can pass the above two tests, but in the practical sense would never be a condition where a preventive action would be feasible. Doing preventive maintenance on a printed circuit board full of IC chips is one example where the practical maintenance approach is to replace it when (and if) it fails.

• Human Error Causes. If the only way this failure mode could happen is the result of an unfortunate (but likely) human error, we note as such for the record. But we drop it from further consideration because we really cannot schedule a preventive action to preclude such random and uncontrollable occurrences. If the hypothesized human error problem is important, we could consider this condition later in the evaluation of other forms of corrective or mitigating action such as redesign through control logic. For each failure mode retained for analysis, the team then decides on its one or two most likely failure causes. A failure cause is, by definition, a 1-3 word description of why the failure occurred. We limit our judgments to root causes. If the failure mode can occur only due to another previous failure somewhere in the system or plant, then this is considered a consequential cause.

Each failure mode retained is now evaluated as to its local effect.

What can it do to the component; what can it do to the system functions; how can it impact the system / plant output? If safety issues are raised, they too can become part of the recorded effect. In the failure effects analysis, assume a single failure scenario. Also allow all facets of redundancy to be employed in arriving at statements of failure effect. Thus, many single failure modes can have no effect at the plant or system level, in which case, designate the failure mode as low priority and don’t pass them to Step 6, Logic Tree Analysis. If there is either a system or plant effect, the failure mode is passed on to Step 6 for further priority evaluation. Those failure modes considered as low priority here are assigned as candidates for run-to-failure (RTF) and are given a second review in Step 7 for final RTF decision.

RTF does not imply that this component or asset is unimportant;

instead components that are designated as RTF have no significant consequence as the result of a failure. It does not matter if failed components are restored immediately as long as they are repaired to an operable status in a timely manner.

FIG. 7 Logic Tree Analysis Structure

Step 6: Logic (Decision) Tree Analysis (LTA)

In Step 6, because of the fact that not all failures are equally important, we need to screen our information further to focus on what really counts.

The Logic Tree shown in FIG. 7 poses three simple questions that require either a Yes or No answer. The result is that each failure mode will ultimately be assigned an importance designator that will constitute a natural ordering of the priority that we should address in allocating our resources. The following coding can be used to label the failure modes:

A - Top priority item B - Second and next significant item C -A low priority item and may likely be a non-PM item to consider D - RTF item

The three questions are:

1. Under normal conditions, do the operations know that something has occurred?


1 looks at operator knowledge that something is not nor mal, given the occurrence of the failure mode. It’s not necessary that operators know exactly what failure mode has occurred. They may pinpoint the exact failure instantly. If they sense an abnormality, they will look to find out what is wrong. This mode is an evident failure mode. If the operators have no clue whatsoever as to the occurrence of the abnormality, the failure mode is hidden, and receives the label "D" at this point.

2. Does this failure mode cause a safety problem? The failure mode is then carried to Question 2 regarding possible safety or environment issues. An answer of Yes to Question 2 picks up an A label on the failure mode, a rating which raises the failure mode to the highest importance level in the LTA.

3. Does this failure mode result in a full or partial outage of the plant? Finally, Question 3 inquires as to whether the failure mode could lead to a plant outage (or production downtime). An answer of Yes here results in a B label; No, by default, results in a C label. A C designates the failure mode as one of little functional significance.

Thus, every failure mode passed to Step 6 receives one of the following labels or categories: A, B, C, D, or any combination of these. Any failure mode that contains an A in its label is a top priority item; a B is the second and next significant priority item; a C is essentially a low priority item that, in the very practical sense, is probably a non-issue in allocating preventive maintenance resources. All C and D/C failure modes are good candidates for RTF. Primary attention will be placed on the A and B labels, which are addressed in Step 7.

Step 7: Selection of Maintenance Tasks The fourth RCM principle, "select applicable and effective maintenance tasks for the high priority failure modes," is addressed in this step.

In Step 7, all team knowledge is applied to determine the most applicable and cost effective tasks that will eliminate, mitigate, or warn us of the failure modes and causes that we assigned to each component or piece of equipment in Step 5. The team revisits those failure modes they initially believed did not impact the functioning of system, and re-evaluates them. Finally, the team compares the new PM program to the old one, seeing where the program has been improved and optimized. Step 7 is comprised of three steps that are discussed in this section:

• Task selection

• Sanity check

• Task comparison In task selection, the following questions are addressed:

• Is the age reliability relationship for this failure known? If yes,

• Are there any applicable Time directed (TD) tasks? If yes,

• Specify those tasks.

• Are there any Condition Directed (CD) tasks? If yes,

• Specify those tasks.

• Is there a "D" category of failure modes? If yes,

• Are there any applicable Failure Finding (FF) Tasks? If yes,

• Specify those tasks.

• Can any of these tasks be ineffective? If no,

• Finalize tasks above as TD, CD, and FF tasks.

• If any tasks may not be effective, then can design modifications eliminate failure mode or effect? If yes,

• Request design modifications.

Task Selection is a key item to be discussed in this step. It’s very important that the team members put their past biases aside at this juncture and develop a creative and free-flowing thought process to put forth the best possible ideas for candidate PM tasks - even if some of their suggestions may sound a bit off-the-wall at first blush. It might also be useful to get the help of predictive maintenance specialists if they are not part of the team.

A final aspect of the task selection process is to revisit the failure modes that are designated as RTF candidates. This is part of the sanity check, to insure all task selections are appropriate. We need to examine other non-function related consequences that could cause us to reverse the RTF decision for reasons such as high cost, regulatory difficulties and violations, the likelihood of secondary failure damage, warranty and insurance factors, or hidden failure conditions. The team can elect to drop the RTF decision in favor of a PM task if they believe that the potential con sequences of the failure mode are severe.

The last item in the RCM system analysis process before proceeding to PM task implementation is Task Comparison. Now the team lays out what they have recommended for an RCM-based program versus the cur rent PM program. This is the first time in the entire process that the team will deliberately examine the current PM task structure in detail.

The difficulty in performing task comparison stems from the fact that the RCM-based PM tasks were developed at the failure mode level of analysis detail whereas the current PM tasks were identified at the component level. Hence, analysts must use their experience and judgment to fit the current PM tasks into a structure that is comparing PM programs at the failure mode level. This can be somewhat difficult at times and may require careful review.


PM Work Instruction Example

1. Work Instruction Title C

2. Task Instructions #

3. Task Interval

4. Priority L

5. Estimated Hours with Skill

6. Actual Hours

7. Component Name and ID #

8. Contact #

9. Special safety instructions L

10. Special Tools requirements

11. Material Handling support requirements

12. Task Objectives

13. Task detailed – steps

14. Spare / parts required

15. As found condition list

16. Work performed

17. Post maintenance test measurement data

18. Other observations

Compressor #xx - area XY PM XXXXX

6 Month / 1000 hours of operations

Low, medium, high, or X, based on organization's priority scheme

Mech. - 20, Elect - 10, Total = 30 Mech. - 18, Elect - 10, Labor - 4, Total = 34 (Inputted after the job has been completed)

Planner #; Systems Engineer #

Lock out / Tag Out details and clearance permits

FIG. 8 Sample of PM Work Instruction Notes: Could the effectiveness of this PM be improved? No / Yes, and how?


Step 8: Task Packaging and Implementation Step 8, task packaging and implementation, is a crucial step for realizing the benefits of RCM analysis. Usually this step is very difficult to accomplish successfully. In fact, the majority of RCM failures happen during this step and analysis results are put aside on the shelf. However, if team members have been selected from all critical areas and they have been participating diligently, implementation will go smoothly and will be successful.

The final implementation action is to write task procedures that communicate analysis results to the actionable instructions to the operations and maintenance teams including CBM/PdM technicians. If the work is multi-disciplined (multi-craft skills), it may require writing separate instructions for each craft group, depending upon union contract requirements. However, the coordination between the craft should be part of each instruction. Nevertheless, it’s always beneficial and effective to have multi-skill crews handle multi-disciplined work. In most cases, these instructions will be kept in the CMMS and will be issued per established schedule or based on CBM data. FIG. 8 shows a list of items that can be part of good PM work instructions.

Step 9: Making the Program a Living One - Continuous Improvements

RCM execution is not a one-time event. It's a journey.

RCM is a paradigm shift in how maintenance is perceived and executed.

An RCM-based maintenance program needs to be reviewed and updated on a continuous basis. A living RCM program consists of:

• Validation of existing program - maintenance decisions made are appropriate

• Reviewing current failure history and evaluate maintenance tasks and their effectiveness

• Making adjustments in maintenance program if needed

A living RCM program assures continual improvement and cost effective operation and maintenance in the organization. We also need to establish some effective metrics to know where the program stands.

Other RCM Processes

There are many derivatives of RCM such as RCM++, RCM cost, RCM turbo, RCM backfit, RCM streamline, VRCM, Abbreviated, and Experience-Based. All of these derivatives help perform RCM cost effectively. Most of them take some shortcuts-cutting some steps, considering only a limited number of failure modes, or automating the process using software to reduce the time taken to complete the analysis. In addition, RCM software programs are also available from JMS software, Isograph, ReliaSoft, Relex, and others. These programs can help to reduce the time taken to perform RCM analyses.

RCM Benefits

• Reliability. The primary goal of RCM is to improve asset reliability and availability cost-effectively. This improvement comes through constant reappraisal of the existing maintenance program and improved communication between maintenance supervisors and managers, operations personnel, maintenance mechanics, planners, designers, and equipment manufacturers. This improved communication creates a feedback loop from the maintenance craft in the field all the way to the equipment manufacturers.

• Cost. Due to the initial investment required to obtain the techno logical tools, training, and equipment condition baselines, a new RCM program typically results in a short-term increase in maintenance costs. The increase is relatively short-lived. The cost of reactive maintenance decreases as failures are prevented and preventive maintenance tasks are replaced by condition monitoring.

The net effect is a reduction of reactive maintenance and a reduction in total maintenance costs.

• Documentation. One of the key benefits of an RCM analysis is understanding and documentation of operations and maintenance key features, failures modes, basis of PM tasks, related drawings and manuals, etc. This documentation can be good training material for new O&M personnel.

• Equipment/Parts Replacement. Another benefit of RCM is that it obtains the maximum use from the equipment or system. With RCM, equipment replacement is based on equipment condition, not on the calendar. This condition-based approach to maintenance extends the life of the facility and its equipment.

• Efficiency/Productivity. Safety is the primary concern of RCM.

The second most important concern is cost effectiveness, which takes into consideration the priority or mission criticality and then matches a level of cost appropriate to that priority. The flexibility of the RCM approach to maintenance ensures that the proper type of maintenance is performed when it’s needed. Maintenance that is not cost effective is identified and not performed.

In summary, the multi-faceted RCM approach promotes the most efficient use of resources. The equipment is maintained as required by its characteristics and the consequences of its failures.

Impact of RCM on a Facility's Life Cycle

RCM must be a consideration throughout the life cycle of a facility if it’s to achieve maximum effectiveness. The four major phases of a facility's life cycle are:

1. Planning (Concept)

2. Design and Build

3. Operations and Maintenance

4. Disposal

It has been documented in many studies that about 80% or more of a facility's life cycle cost is fixed during the planning, design and build phases. The subsequent phases fix the remaining 20% or so of the life cycle cost. Thus, the decision to institute RCM at a facility, including condition monitoring, will have a major impact on the life-cycle cost of the facility. This decision is best made during the planning and design phases. As RCM decisions are made later in the life cycle, it becomes more difficult to achieve the maximum possible benefit from the RCM pro gram.

Although relatively small impact on the overall life-cycle cost, a balanced RCM program is still capable of achieving savings of 10-30% in a facility's annual maintenance budget during the O&M phase. Next>>


Prev. | Next

Article Index    HOME   Project Management Articles