Availability Management
Table of Contents
  Availability Management, is required to optimize the Availability of the IT Infrastructure, and supporting organizations providing IT services. The Service availability is reliant on the complexity of IT Systems and the reliability of IT infrastructure components and environments. The Availability Management function is usually only recognized on the occasions when the service becomes unavailable, and is often measured according to the speed with which 'normal service' is recovered. Most of the work in this department is unseen, in many cases not acknowledged, even though the core activities are going on all the time too ensure a suitable service.

Process Requires Re-writing for conformance with ITIL Version 3.

More...

Visit my web site

Introduction to Availability Management

Availability Management is a process under Service Design in the ITIL Version 3 concept.

Click to whole Service livecycle

Availability, capacity and service continuity management are inter-related. All three strive to eliminate risks to agreed-upon standards of performance for IT services. What distinguishes these three key aspects of service delivery are:

[To top of Page]

Availability Management

Objectives Coverage Policies Scaling Concepts Roles Measuring Processes Appendix

Objectives

The goal of all availability process owners is to maximize the uptime of the various online systems for which they are responsible--in essence, to make them completely fault-tolerant.

"Constraints inside and outside the IT environment make this challenge close to impossible. Budget limitations, component failures, faulty code, human error, flawed design, natural disasters, and unforeseen business shifts such as mergers, downturns, and political changes are just some of the factors working against that elusive goal of 100 percent availability--the ultimate expression of high availability."

By Harris Kern, TechRepublic, 04 January 2003

Objectives - Availability

Objective - Recoverability
Decrease the Mean Time to Repair (MTTR) for Major Incidents through proactive incident readiness measures such as a documented inventory of service restoration procedures.

[To top of Page]

Process Coverage

Scope

Availability Management is concerned with the design, implementation, measurement and management of IT Infrastructure availability in order to ensure stated business requirements for Availability are consistently met. It should be applied to all new IT Services and for existing services where Service Level Requirements (SLRs) or Service Level Agreements (SLAs) have been established, and, should be applied to those IT Services deemed to be survival or business critical, regardless of whether a formal SLA exists. Suppliers (internal and external) should be subject to the same Availability requirements in providing their services.

In Scope
Availability management:

Usually Excluded

Critical Success Factors

Relationship to Other Processes

Change Management (CM)
Effective change management reduces unplanned downtime caused by inadequate planning and testing of application changes and enables a more proactive approach toward problem prevention.

An input from Availability Management to Change Management is details of the planned maintenance regime, i.e. frequency, duration and impact, for components underpinning a new IT Service. An output from Change Management to Availability Management is a schedule of planned maintenance activities for IT components detailing the times and IT Services that will be impacted.

Service Level Management (SLM)
Service level management impacts both service continuity management and availability management. SLM takes primary responsibility for interfacing with customers and determining which IT services are most crucial to the survival of the company, and which alternate means of conducting business are employed if they fail for a prolonged period. An input from Availability Management to the Service Level Management process is an assessment of the Availability that can be delivered for a new IT Service to enable the SLA to be negotiated and agreed. An output from Service Level Management to the Availability Management process is details of the agreed SLA that enables the appropriate Availability measurement and reporting to be instigated. AM supports the service level process by minimizing, to the extent possible, the time required to restore service after a component failure, and through post outage analysis.

Service Continuity Management (SCM)
An output from IT Service Continuity Management is a business impact assessment detailing the vital business functions dependent on IT Infrastructure Availability. An Input from Availability Management to IT Service Continuity Management is the Availability and recovery design criteria to maintain 'business as usual' by preventing or minimizing the impact of failures by use of techniques such as Component Failure Impact Assessment (CFIA).

Capacity Management (CapM)
Availability and Capacity Management are inextricably interconnected. High availability design techniques are usually achieved by adding excess capacity which is called upon in the event one system fails in a redundant configuration. The removal of a device in a load balanced setup call upon the remaining device to shoulder all, or a greater proportion, of the load. In some situations the effect of this can be felt as deteriorated response and a higher risk of failure to the remaining component. An input from Availability Management to Capacity Management is a completed CFIA for a new IT Service denoting where Availability techniques are to be deployed so as to provide additional Infrastructure resilience. An output from Capacity Management to Availability Management is the Capacity Plan detailing how the Capacity requirements associated with the provision of additional Infrastructure resilience will be met.

Putting availability measurement tools in place on networks and application servers requires additional capacity to accommodate the administrative overhead associated with monitoring and fault reporting. This need must be conveyed to Capacity Management.

Financial Management (FM)
An input from Availability Management to IT Financial Management is the Cost of non-Availability arising from the loss of an IT Service(s) to help cost justify improvements defined within the Availability Plan. An output from IT Financial Management to Availability Management is the costs associated with proposed upgrades to the IT Infrastructure to deliver increased levels of Availability.

Problem Management (PM)
Problem Management interfaces with AM on a daily basis to ensure that all component problems have been identified and properly recorded. AM uses this information to assess the results of component outages ad recovery capabilities.

Incident Management (IM)
AM affects IM by removing, through proactive measures, potential incidents from the infrastructure and, by reducing the total impact on the infrastructure and business operations through proactive recovery strategies and procedures.

[To top of Page]

Policies & Guidelines

[To top of Page]

How the Process Scales

In small and medium size organizations the purposes of Availability Management will traditionally be assumed within other management functions. Overall availability considerations will be assumed by each vertical infrastructure area - network, desktop, middleware, application support, etc. This implies an absence or unfocused attention to end-to-end service considerations as each respective area attends exclusive to those configuration items within its' purview.

Organizations with business and mission critical applications which are dependent on high degrees of availability will adopt Availability procedures and tools for those respective applications. One of the spurs for this will be the need for Continuity Plans in the event of a prolonged outage to that system. It is the need for business continuity which drives the adoption of a Disaster Recovery Plan and Continuity measures. The need for improved availability management will follow closely from these concerns.

As part of this re-focusing, the need for a wider, end-to-end service service focus will become evident. The benefits of a corporate Availability Management direction which integrates the separate infrastructure areas and leverages overall Availability investment and planning initiatives will surface. There will increasingly be a recognition of the similarity in availability concerns, both horizontally across application areas and vertically amongst systems. Supporting this movement are the benefits to be derived from common standards and toolset usage within the organization.

The declining costs of high availability components has, and will continue, to make them more widely within the reach of middle-size organizations. The costs of component redundancy is becoming commoditized and within the reach of an increasing number of organizations. Consequently, the precepts and directions associated with high availability designs can be integrated at acceptable costs.

Many of the activities associated with Availability Management are associated with or partly contingent upon the adoption of service level management principles and practices. Since the need and extend of high availability design is a function of the overall availability of systems which are negotiated and recorded in service level agreements, there is a close relationship between the two management disciplines. Moreoever, because availability is affected by overall capacity, Availability Management and Capacity Management should be considered in tandem. The two areas each have annual plans which must be coordinated along with the annual budget plan.

An extension of the concern of Availability to include aspects of recovery eventually brings into focus who does what in this area. For example, Incident, Problem and Availabilty Management all have a concern with regard to the restoration of service resulting from a "Major" outage. Depending upon the characteristics of the organization, the role of Situation (or Major Incident) Manager will fall to a Senior Incident Coordinator, or to Problem Management (as described in Microsoft Operations Framework) or to Availability Management. All have an interest in the outcome:

Since Availability Management's concern represents the other two interests exactly it is the natural confluence for both concerns.

[To top of Page]

Key Concepts

Availability Designs

Availability Designs Availability comes at a cost. As the graph on the right demonstrates the higher the percentage of overall availability desired the greater the cost. The five milestones on the curve represent strategies to achieve greater availability. The ordering reflects where in the consideration of availability the strategy would normally be considered for the first time.

Base Product and Components
Configuration Items (CIs) placed into the infrastructure will have reliability features usually expressed as Mean Time Between Failures (MTBF). The use of products with high MTBF rates will, over a large statistical sample of devices, translate into higher availability.

In most cases many components will form a system, and multiple systems may form a service. To the User it is the Availability of the service which matters - they don't particularly care which component has failed.

Service availability is determined by techniques such as Component Failure Impact Analysis (CFIA). The technique uses the MTBF rates of the components to determine the overall rate. Certain key factors are involved in this determination:

Effective Service Management processes
Effective processes such as ITIL best practices can have a significant impact on improving overall availability. The benefits of ITIL Availability Management are cited in the ITIL Service Delivery manual:

System Management
System management involves toolsets, policies and procedures to keep mission-critical systems up and running. It includes the monitoring, diagnostic and automated error recovery to enable fast detection and resolution of potential and actual IT failure. It makes performance adjustments based on dynamic circumstances by …

High Availability (HA) Design
A design for high Availability will consider the elimination of single points of failure and/or the provision of alternative components to provide minimal disruption to the business operation should an IT component failure occur. The design will seek to eliminate or minimize the effects of planned downtime to the business operation normally required to accommodate maintenance activity and the implementation of Changes to the IT Infrastructure or business application. Recovery criteria should define rapid recovery and IT Service reinstatement as a key objective within the designing for recovery phase of design. The design should utilize fault tolerant devices which make all reasonable attempts to recover from exceptions encountered within or from lower level components. They will not allow invalid data from their clients to introduce errors in their own state. When a system is built using these HA components, overall system robustness will significantly increase service availability.

Having identified potential SPOFs, the designer can decide on what kind of redundancy, if any, is required. There are two basic choices: N redundancy (2N, 3N, .) where every component is duplicated, or N+1 redundancy where only selected components have a backup. A 2N design can often provide faster switch-over times than an N+1 design, but can also be much more expensive if, for instance, the system has a large number of I/O connections (as in a switch, for example). In that case, a single spare connection could serve as the standby for a large number of active connections. Redundancy may be extended to any resource, including network links. The redundant links could, for example, act as stand-bys or provide load-balancing for greater throughput.

Special Solutions with Full Redundancy
To approach continuous Availability in the range of 100% (greater than 99.999% - ie., five 9s) requires expensive solutions that incorporate full redundancy. Redundancy is the technique of improving Availability by using duplicate components. For stringent Availability requirements to be met these need to be working autonomously in parallel. These solutions are not just restricted to the IT components, but also the IT environment, i.e. power supplies, air conditioning, telecommunications. Full redundancy designs add architectural components to increase load balancing, fail-over and redundancies to system configurations. They include such measures as:

The Availability Index

A somewhat different rendering of an availability scale is presented in Blueprints for High Availability. The graph below displays Evan and Stern's Availability Index.

The Availability Index - Evan and Stern

Their book is devoted to describing these ten major groupings of availability strategies. The graph illustrates the relationship between increasing availability and the overall costs to achieve it. Total investments become increasingly great to achieve marginal increases in overall availability. The ten major areas in increasing overall availability are:

  1. Good System Administrative Practices - these practices lay the foundation for all the technologies that can get added to provide better availability. Some are easy to implement while others are hard. ---> 20 Key High Availability Practices
  2. Backups and Restores - Backups represent the last line of defence against data loss. You should regularly test your ability to restore and ensure that backups are not, themselves a single point of failure - make two copies of critical backups
  3. Disk and Volume Management - Disks are a frequent point of failure. Therefore, removing them as a single point of failure (through mirroring) and ensuring the ability to quickly replace a bad disk (how swapping) arew important first and relatively inexpensive ways to achieve higher availability. In addition technological approach such as SAN, NAS and virtualization are other important methods of ensuring disk availability
  4. Networking - networking computers has increased the overall complexity and number of potential failure nodes. Networks experience periods of peak loads which are sometimes unpredictable. problems are often difficult to track down. They are also highly sensitive to denial of service attacks
  5. the environment in which critical systems operate - Data centres require specific operating characteristics to minimize risk and maximize server and application availability and performance
  6. managing clients - there are two types of clients- those residing on the internal network and mobile clients. The latter presents enhanced difficulty in eliminating client disks as a single point of failure for business-critical data and re-creating client systems in the event of failure or a request for a replacement.
  7. Application Level Recovery Methods - Applications should deal with failures in a way which minimizes the overall effect on the business. The design of the application as well as the robustness of the application software will determine how well the system will tolerate minor failures without overly lengthy outages
  8. Clustering Techniques - methods to ensure the availability of a second (or more) system in the event that a primary system fails. The standby server quickly takes over the entire load of the system (ie. failover).
  9. Replication Techniques - the copying of data from one disk to another, completely independent system resulting in two equally consistent and viable data sets.
  10. disaster recovery - the ability to recreate the environment exactly within a specified time-frame following a failure which carries with it a risk or high possibility of not being recoverable within a specified timeframe (usually days or longer).
[To top of Page]

Recoverability StrategiesRef

Designing for Availability is a key activity driven by Availability Management. This ensures that the stated Availability requirements for an IT Service can be met. However, Availability Management should also ensure that within this design activity there is focus on the design elements required to ensure that when IT Services fail, the service can be reinstated to enable normal business operations to resume as quickly as is possible.

'Designing for Recovery' may at first sound negative. Clearly good Availability design is about avoiding failures and delivering where possible a Fail-Safe IT Infrastructure. However, with this focus is too much reliance placed on technology and has as much emphasis been placed on the Safe-Fail aspects of the IT Infrastructure? The reality is that failures will occur. The way the IT organization manages failure situations can have the following positive outcomes:

A key aim is to avoid small Incidents becoming major by ensuring the right people are involved early enough to avoid mistakes being made and to ensure the appropriate business and technical recovery procedures are invoked at the earliest opportunity.

This is the responsibility of the Incident Management process and role of the Service Desk, but, to ensure business needs are met during major IT Service failures and to ensure the most optimal recovery, the Incident Management process and Service Desk needs to have defined and execute:

Understanding the Incident 'lifecycle'
It is important to recognize that every Incident passes through a number of stages. These are described as follows:

This 'lifecycle' view provides an important framework in determining amongst others, systems management requirements for Incident detection, diagnostic data capture requirements and tools for diagnosis, recovery plans to aid speedy recovery and how to verify that IT Service has been restored.

Systems Management
The provision of Systems Management tools positively influences the levels of Availability that can be delivered. Implementation and exploitation should have strong focus on achieving high Availability and enhanced recovery objectives. In the context of recovery, such tools should be exploited to provide automated failure detection, assist failure diagnosis and support automated error recovery.

Diagnostic data capture procedures
When IT components fail it is important that the required level of diagnostics are captured, to enable Problem determination to identify the root cause. For certain failures the capture of diagnostics may extend service downtime. However, the non-capture of the appropriate diagnostics creates and exposes the service to repeat service failures.

Where the time required taking diagnostics is considered excessive; a review should be instigated to identify if techniques and/or procedures can be streamlined to reduce the time required. Equally the scope of the diagnostic data available for capture can be assessed to ensure only the diagnostic data considered essential is taken. The additional downtime required to capture diagnostics should be included in the recovery metrics documented for each IT component.

Determine backup and recovery requirements
The backup and recovery requirements for the components underpinning a new IT Service should be identified as early as possible within the development or selection cycle. These requirements should cover hardware, software and data. The outcome from this activity should be a documented set of recovery requirements that enable the development of appropriate recovery plans.

Develop and test a backup and recovery strategy and schedule
To anticipate and prepare for performing recovery such that reinstatement of service is effective and efficient requires the development and testing of appropriate recovery plans based on the documented recovery requirements. The outcome from this activity should be clear, operable and accurate recovery plans that are available to the appropriate parties immediately the new IT Service is introduced. Wherever possible, the operational activities within the recovery plan should be automated. The testing of the recovery plans also delivers approximate timings for recovery. These recovery metrics can be used to support the communication of estimated recovery of service and validate or enhance the CFIA documentation.

Recovery Metrics
The provision of a timely and accurate estimation of when service will be restored is the key informational need of the business. This information enables the business to make sensible decisions on how they are to manage the impact of failure on the business and on their Customers. To enable this information to be communicated to the business requires the creation and maintenance of recovery metrics for each IT component covering a variety of recovery scenarios.

Backup and recovery performance
Availability Management must continuously seek and promote faster methods of recovery for all potential Incidents. This can be achieved via a range of methods including automated failure detection, automated recovery, more stringent escalation procedures, exploitation of new and faster recovery tools and techniques.

Service restoration and verification
An Incident can only be considered 'closed' once service has been restored and normal business operation has resumed to the User's satisfaction. It is important that the restored IT Service is verified as working correctly as soon as service restoration is completed and before any technical staff involved in the Incident move on to over incidents or activities. In the majority of cases this is simply a case of getting confirmation from the User. However, the User for some services may be a Customer of the business.

For these types of services it is recommended that IT Service verification procedures are developed to enable the IT support organization to verify that a restored IT Service is now working as expected. These could simply be visual checks of transaction throughput or User simulation scripts that validate the end-to-end service.

[To top of Page]

Availability - Recoverability Trade-off Considerations

Designing for high availability (HA) in mission critical systems requires a split personality. The systems designer must first ensure that faults - in both hardware and software - happen as rarely as possible. But the designer must also assume that faults will occur and take precautions to ensure that the system recovers quickly.

Question is, which precautions are necessary? To answer that, the designer must first determine which services in a system actually require HA and what degree of HA each of those services needs. Once that's been decided, the designer can begin identifying any potential SPOF; that is, any component - be it a CPU, networking card, power supply, or software module - whose failure can cause the service to fail.

Availability - Recoverability Trade-off For non survival critical applications the trade-offs are most often financial. Availability (ie., uptime) is expressed as a percentage of the time the system will be available over the total time of committed operations (eg., 7x24 minus scheduled maintenance, 7 x 7 - Monday through Friday, etc). While the organization may target for continual availability the costs implied in implementing fault tolerant and total redundancy to remove all single points of failure can be exorbitant. The correct trade-off between availability and recovery strategies is illustrated in the diagram to the right.

Beyond a certain level of Uptime the costs begin to increase at exponential rates. At as certain point it may be more cost effective to invest in recovery strategies. That point will be determined by the extent of losses incurred by the line of businesses when outages happen.

The organization's ability to make an accurate assessment of theses trade-offs is contingent upon its' ability to assess the business costs of Unavailability -ie., the losses incurred when systems and services are in various degrees of unavailability and/or degradation. Because of the costs associated with implementing high availability should be a comprehensive effort. Care and attention should be given to determining Business Availability Requirements to accurately assess how much unavailability can be tolerated and under what circumstances (eg., different times of the day or year, geographical differences, etc).

The ability to do this determination accurately is vastly furthered by a mature Service Level Management process. The SLM process would encompass the assessment in a negotiation process wherein added availability is discussed in the context of the additional costs of delivering it. The trade-off would be subject to a negotiation exercise which recognized this and encompassed within an overall Availability Plan wherein individual availability needs could be packaged into an enterprise assessment thereby achieving some economies in purchasing and support.

[To top of Page]

Availability Plan

To provide structure and aggregation of the wide range of initiatives that may need to be undertaken to improve Availability, these should be formulated within a single Availability Plan.

The Availability Plan should have aims, objectives and deliverables and should consider the wider issues of people, process, tools and techniques as well as having a technology focus. In the initial stages it may be aligned with an implementation plan for Availability Management, but the two are different and should not be confused.

Over time, as the Availability Management process matures, it should evolve to cover the following:

During the production of the Availability Plan, the following functional areas should be consulted:

The Availability Plan should cover a period of one to two years with a more detailed view and information for the first six months. The plan should be reviewed regularly with minor revisions every quarter and major revisions every half year. Where the IT Infrastructure is only subject to a low level of Change this may be extended as appropriate.

It is recommended that the Availability Plan is considered complementary to the Capacity Plan and publication aligned with the Capacity and business Budgeting cycle. If a demand is foreseen for high levels of Availability that cannot be met due to the constraints of the existing IT Infrastructure or budget, then exception reports may be required for the attention of both senior IT and business management.

[To top of Page]

Availability Measurement and Reporting

"not everything that can be counted counts and not everything that counts can be counted"

Service Level Management for Enterprise Networks, Lundy Lewis, Artech House, 1999, ISBN: 1-58053-016-8, p.13

Availability is measured at the Users screen, not at some intermediate component. The challenge is in integrating all intermediate components into the end-to-end metric. The result, in all cases, is that a much higher availability is required from each subcomponent (since individual component availabilities are multiplicative to the overall measure).

To meet this overall requirement requires the stating and operational readiness of the individual components in the service chain. These metrics are stated as Operational Level Objectives (OLOs) amongst internal service partners and in Underpinning Contracts (UCs) when the service is provided by an external provider (eg. ISP, ASP).

There are five primary methods by which availability data gets captured:

Applications with response times recorded above a pre-defined threshold are considered unavailable. A combination of the above method usually provides the best overall result though considerations of cost, reliability and intrusiveness may preclude some approaches.

These collection techniques fall into one of two primary categories:

End-to-End Versus Component Availability
"The process of acquiring meaningful performance data is broken. We can report on "five nines" for all of our network components; we can dump database extracts into Excel and figure out what percentage of our calls related to a specific incident; we can dig into our ERP system and calculate transaction times. What we can't seem to do, however, is to pull these numbers together into business indicators that show whether or not we're successful: not successful in keeping all of our servers online, or successful in closing all of our open calls, but successful in terms of our company's vision and business objectives."

Char LaBounty, The Art of Service Management, p.5

The first step in measuring and managing service levels is to define each service and map out the service from end-to-end.

Operational Metrics

  1. Operating system service on hardware, presuming hardware availability. Most platform vendors that claim 99.9 percent uptime are referring to this.
  2. End-to-end database service, presuming operating system and hardware availability.
  3. Application service availability, including DBMS, operating system, and hardware availability.
  4. Session availability, including all lower-level layers.
  5. Application server divorced from the database. In this scenario, the business logic and connectivity to a data store are measured (and managed) independently of the database component. Note that a combination of (2) and (5) are essentially the same as service (4) to the user/client.
  6. A complete, end-to end measure, including the client and the network. While the notion of a service implies the network, it is included in this diagram to show that you can establish the measure of availability for the stack as a whole with or without the network. For internet-based applications, separating the network is important, because service providers can rarely, if ever, definitively establish and sustain service levels across the public network. Moreover, when a user connects across the internet, it's important to understand how much of the user experience is colored by the vagaries of the internet, and how much is under the direct control of operational staff. Decomposition into services is the first step toward defining what availability is measured, and why. As will be seen, indicating end-user availability over time does not require every service component to be measured and tracked separately.

There are three primary methods for capturing type 6 - the end-to-end measurement:

Generating sample transactions that simulate the activities of the user community and that can be monitored - using scripts and intelligent agents or tools to capture transactions and later playing them back against an application service, this approach allows a simulated response time to be measured. By using distributed server resources or placing dedicated workstations at desired locations to submit transactions for critical applications, a continuous sampling of response times by location can be captured and reported. The strength of this method is in its ability to provide the end-user experience using samples rather than having to collect large volumes of data across all transactions from all end users.

Unavailability
Traditionally availability is reported as a percentage of total time a system or component was available divided by the time it was supposed to be available (removing time set aside for maintenance operations and/or time the system is not required to be operational). This statistic favors the IT Provider because it tends to downplay the effects of service outage. If the goal is 99.6% available and the result is 99.0% then the Provider is only .6% off target which translates to (99.6-99.0)/99.0 = .61% variation from expected - on the surface a good result.

Stating the goal as percent Unavailability has a different impact. In this case the goal was .4% unavailability and the Provider had 1% unavailability. Here the percentage is (1.0-.4)/.4 = 150% variation from expected. An entirely different perspective ensues. Even quoting unavailability does not provide the whole picture. From the Customer's perspective an outage of 2 hours may be more palatable than 6 outages of 20 minutes each (or it may not - depending on the application). The difference would certainly lead to the perception of greater instability in the second instance.

The purest and most useful measure is to relate unavailability to costs to the business. To capture this requires that the Service Desk accurately record the number of people affected and that this gets multiplied by an amount per unit time of outage. This latter number must be established ahead of time. Frequently, this cost may have been devised as part of a Business Impact Analysis (BIA) in developing Disaster Recovery Plans (DRPs) for major applications.

Availability Measures
There are a number of related metrics which combine to present a complete picture of availability. The following diagram illustrates how they relate to each other as an expression of a component failure.

Recovery Metrics

This illustration cites the various stages that occur between reported incidents which translate to "outages" for Availability Management. It depicts the "Incident Lifecycle" and consists of the following identifiable stages:

Reference
Lifecycle Stage
Availability Measure
Description
A
Detection Elapsed Time
 
The time between the occurrence of an Incident and it's detection by a monitoring agent, User of support staff
B
Response Time
 
The time to diagnose the incident - ie., between the detection of an incident and the beginning of the restoration effort
C
Repair Time
 
The time to recover service - ie, between the commencement of the recovery exercise and the time the service is considered 'fit for purpose' by the recovery agent
D
Recovery Time
 
The time between when the service is “Fit for Purpose” and the infrastructure is considered restored. This time would usually include testing and user verification of “Fit for Purpose”
E
Uptime
Mean Time To Failure (MTTF)
Time between the restoration of service associated with a component and the next occurrence of a failure - ie. no incident occurring
F
Detection of Next Incident
   
G
Time to Repair
Mean Time to Repair (MTTR)
A + B + C + D
H
Time Between System Incidents
Mean Time Between Failure (MTBF)
E + G

Roles and Responsibilities

Availability Manager

Senior Leadership

Availability Management Team

Operations Management

Line(s) of Business

[To top of Page]

Performance Measurement

Key Goal and Performance Indicators
The User view of Availability is influenced by three factors:

Measurements and reporting of User Availability should therefore embrace these factors. The methodology employed to reflect User Availability could consider two approaches:

The method employed should be influenced by the nature of the business operation. A business operation supporting data entry activity is well suited to reporting that reflects User productivity loss. Business operations that are more Customer facing, e.g. ATM services, benefit from reporting transaction impact

Metrics
The table below translates uptime requirements into annual downtime.

Percent Uptime

Annual Downtime

Percent Uptime

Annual Downtime

Percent Uptime

Annual Downtime

100

0

99.99

52.8 minutes

99.8

17 hours, 30 minutes

99.9999

Less than 1 minute

99.98

1 hour, 45 minutes

99.5

43 hours, 43 minutes

99.999

5.25 minutes

99.9

8 hours, 45 minutes

99.0

87 hours,  36 minutes

Measurement Issues

[To top of Page]

Processes

Availability Process Summary

Controls
  • SLAs,OLAs, UCs
  • available funding and investment policies
  • infrastructure standards policies
  • Availability and recovery design criteria
Inputs
Activities
  • Maintain availability guidelines
  • Determine costs of service/system unavailability
  • Determine availability requirements
  • Assess feasibility of Availability proposals
  • Establish Availability strategies
  • Formulate designs for availability and recovery
  • Produce and maintain an Availability Plan
  • Define CI targets for availability, reliability and maintainability/serviceability
  • Establish methods and report results
  • Maintain AMDB
  • Conduct supplier reviews
  • Maintain awareness of high availability HW and SW tools and their associated costs
  • Develop and maintain recovery strategies
Outputs
  • IT infrastructure resilience and risks assessments
  • Targets for availability, reliability and maintainability
  • Availability monitoring
  • Reports of availability, reliability and maintainability/serviceability achieved
  • Availability Plan updated
  • Availability Improvement Plans
  • System recovery documentation
  • Updated AMDB
Mechanisms
  • Customer Relationship Management
  • Project Management
  • IT-Business alignment
  • Measurement
  • Continuous Improvement
  • Risk Management
  • Change Management
  • Situation Management
  • Problem management
  • Security Management
  • Supplier Management
  • Continuity Management

[To top of Page]

Inputs

[To top of Page]

Controls

[To top of Page]

Outputs

[To top of Page]

Mechanisms

[To top of Page]

Activities

Availability Design

Maintaining Availability Guidelines
Availability policies and guidelines (ie., this document) are maintained under the control of Change Management. Availability Management will, from time to time, make modification to policies, descriptions, etc maintained in this document and be responsible for ensuring it is reviewed in accordance with IS documentation protocols (an aspect of Change Management).

Determining Costs of Unavailability
The overall costs of an IT Service are influenced by the levels of Availability required and the investments required in technology and services provided by the IT support organization to meet this requirement. Availability costs and those costs increase exponentially as requirements approach continuous operations.

However, it is important to reflect that the UnAvailability of IT also has a cost - UnAvailability isn't for free either. For highly critical business systems it is necessary to consider not only the cost of providing the service but also the costs that are incurred from failure. The optimum balance to strike being the cost of the Availability solution weighed against the costs of UnAvailability.

The cost of an IT failure could simply be expressed as the number of business or IT transactions impacted, based on either an actual figure (derived from instrumentation) or an estimate. When measured against the vital business functions that support the business operation this can provide an indication of the consequence of failure.

The advantage of this approach is the relative ease of obtaining the impact data and the lack of any complex calculations. It also becomes a 'value' that is understood by both the business and IT organization. This can be the stimulus for identifying improvement opportunities and can become a key Metric in monitoring the Availability of the IT Service. The major disadvantage of this approach is that it offers no obvious monetary value that would be needed to justify any significant financial investment decisions for improving Availability.

Where significant financial investment decisions are required it is better to express the cost of failure arising from System, application or function loss to the business as a monetary 'value'. The monetary value can be calculated as a combination of the tangible costs associated with failure, but can also include a number of intangible costs . The monetary value should also reflect the cost impact to the whole organization, i.e. the business and IT organization.

Assigning costs of Unavailability allow the organization to establish what is acceptable risk. Once established then highly reliable analyses of the optimal level of availability are possible and a correct assessment of the availability-recoverability trade-off is possible.

Determining Availability Requirements
In response to a request, Availability Management evaluates Availability plans, service catalog descriptions, OLAs and SLAs, etc to ensure consistency with updated business plans, new service requirements, strategic initiatives, etc. On the basis of this the Availability documentation is updated to accommodate revised availability, contingency and security needs. Designs are then created detailing what needs to be developed or purchased to fulfill the requirements. Plans define the timeframes within which each design component is to be implemented, how performance will be tracked and evaluated, and how to submit availability improvement recommendations through Change Management. Service Level Management, as the overseer of the Service Catalogue, is consulted as to how the service needs to be modified to accommodate the revised Availability specifications. Likewise, Security is consulted whenever security matters are affected by the modifications. Current configurations are referenced in the CMDB and any plans for their modification detailed within the database. Capacity Design documents are referenced to ensure conformance with revised availability needs and IT Finance is informed of any costs and charging implications.

Assessing Feasibility of Availability Proposals
Using availability goals and objectives and information accompanying new requests, Availability Management will assess the feasibility of new availability initiatives - whether they are practical and cost effective for the organization to implement. Using a template such as that presented in the Appendix, Availability Management presents its recommendation and reasons for the rejection of possible alternative approaches.

Establish Availability Strategies
There is a growing plethora of high availability choices emerging at acceptable costs. These include such approaches clustering, load balancing, function (most prominently report generation) offloading, fail-over, etc. Maintaining the requisite expertise required on an ongoing basis at exceptable costs requires the selection of key strategic approaches and strategies. Availability Management should suggest such strategies and remain aware of the risks associated with their adoption.

Formulating Designs for Availability and Recovery Architectures
When new services are being introduced or there is an entire revamping of an application, then Availability Management will participate, along with Architectural Planning, in ensuring that Availability and Recovery concerns are considered during system implementation. The principle involved here is that planning for availability at the outset ultimately leads to fewer problems of compatibility and that consolidated purchasing will produce cost savings.

Maintain Availability Plan
The Availability Plan represents the culmination of much of the planning activities associated with Availability Management. It should form an integral part of the annual operational review within the organization, and, as such, will relate availability needs to available revenue and expenditure forecasts. It contains summaries and rationales for the availability strategy, feasibility assessments and design activities. The plan will contain:

Define CI Targets for Availability, Reliabilty and Maintainability
Availability Management should maintain records of the Availability, reliability and maintainability/serviceability of Configuration Items (CIs). The best place to do this would be in a CMDB (though lacking this level of sophistication in the CMDB some method of linking the CI to another data source (eg. AMDB) may be considered). Comparing the targets for component availability with actual failure rates would provide Availability Management with valuable information for Availability improvement initiatives. It would also allow them to consider the MTTF rates of certain devices with the claims of other product manufacturers thereby permitting the gradual infusion of more reliable devices into the infrastructure.

Collect Availability Data and Report Availability Results
Hand in hand with establishing CI component targets is the collection of actual failures. The ability to do this is contingent upon the ability of the Incident Management system to record or link to a CMDB or Asset Management system which lists the the CIs of all failed devices. Component performance statistics should be maintained in the AMDB for reference as required and correlational and trend analysis of component data may result in investment recommendations to improve overall availability.

Maintain Availability Database
There can be vast amounts of availability data extracted from monitoring devices. This data should be maintained in a source easily accessible to Availability Management. Conversely, Availability Management may design key summary statistics and maintain this in lieu of the raw data. The advantage of the latter approach is lower maintenance but the drawback is loss of flexibility in statistical analysis. These decisions are identical to the considerations involved in maintaining management information in data warehouses.

Conduct Supplier Reviews
The availability obligations of suppliers should be described in contracts with penalties for non performance. Periodically, Availability Management should conduct reviews of this performance and recommend remedial actions when suppliers fail to meet expectations.

Maintain Awareness of High Availability Products
Availability Management should maintain an awareness of developments in the promotion of high availability products in order to provide better presentation of availability options and their associated costs when presenting feasibility studies, improvement plans and when presenting availability and recovery designs.

Develop and Maintain Recovery Strategy
Continuous availability is crucial to businesses which utilize technology for direct sales to customers. In these situations, the customer will simply go to another agent should there be outages of more than a few seconds. In the insurance industry, the primary customer today remains the insurance agent. The agent has a limited number of companies upon which to compare and usually has a period measurable in hours to develop comparisons or options. In this environment, outages up to 30 minutes can usually be tolerated without serious repercussions. Beyond this period, however, sales losses will increase dramatically as agents and buyers seek other sources. Further, as long as these "mission critical" areas are maintained, supporting services can usually tolerate outages measured in hours before the impact will reverberate to the front-line service areas.

Given these assumptions, there is the need to develop a availability strategy which contains some emphasis on the need for recovery within minutes or hours. While obviously it is preferable to reduce the occurrence of such outages, the exponential costs of moving from 99.9% to 99.99% (ie. towards full redundancy) would suggest that the additional money might be better spent in recovery strategies.

[To top of Page]

Recovery Strategies

Develop and Maintain System Recovery Procedures
Systems Recovery Procedures for Mission Critical applications should be maintained by Services Continuity Management. These procedures should be readily available to Availability Management in the event of the need for quick service restoration. In addition, Availability Management may maintain sets of generic restoration procedures for such things are file servers, back-up servers, routers, etc which can be referenced to ensure speedy restoration of these devices.

Availability Management should undertake to have recovery procedures created for all Mission Critical applications and for services for which recovery is sufficiently complex that documenting and implementing standard processes for recovery will provide greater assurance of success and less chance of creating further incidents. These procedures should be controlled through Change Management methods including testing them in a pre-production environment. Perform System Outage Analysis (SOA). Availability management should prepare a SOA following the failure of all Mission Critical services as well as other devices or services experiencing recurrent problems. Availability Management should work closely with Problem Management in doing so since they share a concern and responsibility for removing the causes of these problems. A` template for an SOA is presented in Appendix.

Analyze Incident Lifecycle for Improvement
It is important to recognize that every Incident passes through a number of stages. These are described as follows:

This 'lifecycle' view provides an important framework in determining amongst others, systems management requirements for Incident detection, diagnostic data capture requirements and tools for diagnosis, recovery plans to aid speedy recovery and how to verify that IT Service has been restored.

From the point of view of Availability Management there may be bottlenecks discovered in this process which, if modified, could result in quicker restoration of services (and, hence higher availability). If de-composed for the purpose of operational review these descrete activities can be analyzed for workload and latency problems for which solutions might be advanced.

Proper Incident recording is strategic enabler of proactive Incident Management. Achieving this requires more than measurement of total elapsed resolution times - it requires analyzing the reasons why these times were not obtained. This information is only available by reviewing the Incident lifecycle and, logging elements of that lifecycle in a manner which permits identification of repeatable elements and triggers areas for process improvement. It requires more information than the time between Opening and Closing the Ticket (ie. MTTR). The total resolution effort typically involves multiple participants as Incidents are escalated from the Service Desk to Incident support to vendor support. Incident ownership undergoes transferance and each stage of the process and/or ownership change is a potential point of delay. The goal is to identify bottlenecks in the process where improvements can be implemented.

To do this properly requires the enterprise to indicate on the Incident Ticket when the process moves through phases of the Incident lifecycle. In some instances, these may be 'Status' changes on the Incident Ticket. Unfortunately, there is seldom a notation on the Incident Log or Audit trail of the time of the Incident status change. As well, some service milestones within the Incident Management process do not link up directly with Status changes. Availability Management should consider adding status flags and encouraging entry log creation in order to provide valuable information to analyze and improve the incident management process - the ultimate goal is speedier recovery and higher availability.

Review Continuity Strategy and Plans
Availability management participates in the regular review of Disaster Recovery and Continuity Planning. This insures that the organization retains its' level of preparedness in the event of a catastrophe. Maintaining Disaster recovery readiness is a time-consuming exercise without immediate demonstrable benefits. Organizations with a heavy dependence upon technology to operate, need to understand the costs of prolonged, unplanned outages and weigh those costs against an acceptable risk management profile.

[To top of Page]

Appendix

Terminology Process Maturity Level AM Toolsets Recovery Template SOA Template

Terminology

Term

Definition

Agreed Service Time

The time a service is expected to be available in a given period. For example, in a 24 x 7 x 365 operation the agreed service time would be 8760 (the number of hours in a standard year).

Application Response Measurement (ARM)

A standard method of instrumenting applications for management. ARM comprises APIs designed to be built into networked applications, enabling them to be monitored for a range of performance characteristics, including response time.

Availability

Ability of a component or service to perform its’ required function at a stated instant or over a stated period of time. It is usually expressed as the availability ratio, i.e. the proportion of time that the service is actually available for use by Clients within the agreed service hours.

Availability Management Database (AMDB)

A data repository used to record and store selected data and information required to support key activities such as report generation, statistical analysis and Availability forecasting.

Availability Plan

A long-term plan for the improvement of IT availability within an agreed cost.

Auditability

Ability of Customer to audit the service or system as required

Backup

A technique whereby a system component is duplicated and the backup is put into reservce or standby, to be used when the primary fails.

Baseline

The present state of performance, from which changes to services can be reviewed.

Clustering

A technique whereby a system's workload is shared by multiple components or resources all operating at the same time. The separate and independent components act together as though they were a single resource.

Component Failure Impact Analysis (CFIA)

A technique developed by IBM which uses a matrix to identify areas of risk in IT service provision by linking the failure of individual CIs to their impact on the overall level of service provided.

Client

People and/or groups who are the targets of service. To be distinguished from User – the consumer of the target (the degree to which User and Client are the same represents a measure of correct targeting) and the Customer – who pays for the service.

Component  Availability

The availability of a particular component of a service offering – defined within standard hours of operation

Configuration item (CI)

Component of an infrastructure - or an item, such as a Request for Change, associated with an infrastructure - that is (or is to be) under the control of Configuration Management. CIs may vary widely in complexity, size and type, from an entire system (including all hardware, software and documentation) to a single module or a minor hardware component.

Configuration Management
Database (CMDB)

A database that contains all relevant details of each CI and details of the important relationships between CIs.

CRAMM

CCTA Risk Analysis and Management Method – tool and techniques for the identification of risks and the provision of justified countermeasures to reduce or eliminate the threats posed by such risks

CSF

Critical Success Factors – the most important issues or actions for management to achieve control over and within its’ IT processes.

Customer

Payer of a service; usually the Customer management has responsibility for the cost of the service, either directly through charging or indirectly in terms of demonstrable business need.

Data mining

The automated extraction of hidden predictive information from databases

Downtime

The amount of time a service is not available to Users. There are typically two types – Planned Downtime is the time set aside for maintenance activities that can be predicted; unplanned is unexpected events such as failures and prolonged outages.

End-to-end Measurement

A view of IT service that includes each of the end users of a service and their locations, together with the path they take to access the business application providing the core part of the service

End-User’s perspective

The performance of a service as it is experienced by the user at the desktop. This perspective is the ultimate measure of service quality.

Environment

A collection of hardware, software, network and procedures that work together to provide a discrete type of computer service. There may be one or more environments on a physical platform e.g. test, production. An environment has unique features and characteristics that dictate how they are administered in similar, yet diverse, manners.

Incident

Any event that is not part of the standard operation of a service and that causes, or may cause, an interruption to, or a reduction in, the quality of that service.

Fail-over

The process of switching from a failed component to a working component.

Failure in Time (FIT)

A statistic for a component that measures how many failures the component will have per one billion operating hours. The lower the FIT Rate for a component is, the better the component is. FIT Rate is used to find Calculated MTBF.

Fault

An error in hardware, software or procedure either latent or discovered

Fault Tolerant

A technique whereby systems or system components are designed to continue operating even during fault conditions. Some fault tolerant systems are designed to withstand internal errors, whereas others address failures in external or connected components.

Hot Swap

The ability to replace hardware or software modules without shutting down other parts of the system, thus reducing Mean Time To Repair (MTTR)

Isolation

A system is separated or divided into separate systems, so a fault condition in one does not affect the other. This technique is more effective in limiting the scope of outages than in preventing them.

Known Error

An incident or problem for which the root cause is known and for which a temporary work-around or a permanent alternative has been identified.

KGI

Key Goal Indicator – measures which tell management – after the fact – whether an IT process has achieved its’ business requirements, usually expressed in terms of information criteria:

·       Availability of information needed to support  business needs

·       Absence of confidentiality and integrity risks

·       Cost-efficiency of processes and operations

·       Confirmation of reliability, effectiveness and compliance

KPI

Key Performance Indicator – measures to determine how well the IT process is performing in enabling KGIs to be reached: lead indicators of whether a goal will likely be reached: and are good indicators of capability, practice and skill

Latency

The period of time that one component in a system waits for another component

Life-cycle

A series of states connected by allowable transitions. The life cycle represents an approval process for Change documents.

Line(s) of Business (LOB)

The parts of an organization that function as separate business entities when viewed from the highest level

Load Balancing

A method of using clusters of systems to re-distribute the workload so that no one system suffers degraded performance due to transaction volume

Maintainability

The ability of an IT Infrastructure component to be retained in, or restored to, an operational state

Masking

System’s ability to hide or “mask” faults from the user.

Mean Time Between Failure (MTBF)

The average time a device will function before failing measured in machine operational hours. The more material that is added to a serial system, the lower the resultant MTBF, or the higher the failure rate, will be. This premise is the basis for seeking alternate design approaches, such as fault tolerance, fault resilience, or redundancy, which try to keep a system operational even in the event of a hardware failure— satisfying the customer’s definition of reliability.

Mean Time to Repair (MTTR)

The average elapsed time from the occurrence of an Incident to its’ resolution.

Metric

A data element that indicates something about the behavior of a system, subsystem, application or process.

Monitoring Tools

Software or hardware that retrieves data about the state of the underlying components driving a service. The data is used to issue alerts about degrading performance but can also be stored in a database and later retrieved and put into consolidated reports.

Online Analytical Processing (OLAP)

A category of database software which provides an interface such that users can transform or limit raw data according to user-defined or pre-defined functions, and quickly and interactively examine the results in various dimensions of the data.

Partitioning

A system is separated or divided into separate systems, so a fault condition in one does not affect the other. This technique is more effective in limiting the scope of outages than in preventing them.

Performance

A measure of the responsiveness of the application to interactive users and the time required to complete a transaction.

Probe

Standalone hardware devices containing RMON and RMON II agents along with packet parsing and filtering engines.

Problem

Unknown underlying cause of one or more incidents.

Recoverability

The ability to resume processing after unplanned outages as rapidly as possible

Redundancy

A technique whereby a system component nis duplicated and either of the two may be used at any one time. Since two identical components are now online, the system can continue to function when one of them fails.

Reliability

Freedom from operational failure often expressed as Mean Time Between Failure (MTBF). The inherent reliability of a system is a function of the sum of the unreliabilities (failure rates) of all components in the system

RMON

Short for remote monitoring, a network management protocol that allows network information to be gathered at a single workstation.

Role

A set of responsibilities, activities and authorizations.

Security

The Confidentiality, Integrity and Availability (CIA) of the data associated with a service; an aspect of overall Availability.

Serviceability

Contractual arrangements made with Third Party IT Service providers (e.g. Facilities Management) to assure the Availability, reliability and maintainability of IT Services and components under their care.

Service

An intangible set of benefits provided by one party to another. Services are created by performing certain activities. Services are usually created by a large number of activities that create the benefits together. Each activity contributes directly or indirectly to the set of benefits. Activities that directly create a benefit are called service delivery activities. Activities that indirectly contribute to the delivery of services are called service support activities.

Service Chain

The provision of a service which involves multiple participants who deliver separate components of the service in a logical order

Service Availability

The capability to successfully complete an entire service or business transaction – defined within standard hours of operation

Service Catalog

A description of the services offered by an organization which can be used to order and describe a service in full.

Service Incident/problem

A fault in the infrastructure attributable to an incorrect expression of a service

Service Level Agreement

A written agreement between a service provider and Customer(s) that documents agreed services and the levels at which they are provided at various costs.

Service Level Management

Disciplined, proactive methodology and procedures used to ensure that adequate levels of service are delivered to supported IT users in accordance with business priorities and at acceptable costs.

Simulated (Synthetic) Transaction

An artificially created transaction designed to mimic a standard transaction over an IP network designed to get consistent readings on response time and availability

Single Point of Failure (SPOF)

Any component — CPU, disk drive, cable, communications stack, device driver — whose failure can render the entire system unusable.

System

An integrated composite that consists of one or more of the processes, hardware, software, facilities and people, that provides a capability to satisfy a stated need or objective.

System Outage Analysis (SOA)

A technique designed to provide a structured approach to identify end-to-end Availability improvement opportunities that deliver benefits to the User. Many of the activities involved in SOA are closely aligned with those of Problem Management

User

The consumer of a service – the person or group who accesses and uses the service – to be distinguished by the Customer, who Pays for the service (though it may be the same person)

Vital Business Function (VBF)

The business critical elements of the business process supported by an IT Service

[To top of Page]

Availability/Capacity Management Maturity Levels

Maturity Level COBIT Description
0 non Existent Management has not recognized that key business processes may require high levels of performance from IT or that the overall business need for IT services may exceed capacity. There is no capacity planning process in place.
1 Ad Hoc (ignored procedures) Performance and capacity management is reactive and sporadic. Users often have to devise work-arounds for performance and capacity constraints. There is very little appreciation of the IT service needs by the owners of the business processes. IT management is aware of the need for performance and capacity management, but the action taken is usually reactive or incomplete. The planning process is informal.
2 / Repeatable but Intuitive Business management is aware of the impact of not managing performance and capacity. For critical areas, performance needs are generally catered for, based on assessment of individual systems and the knowledge of support and project teams. Some individual tools may be used to diagnose performance and capacity problems, but the consistency of results is dependent on the expertise of key individuals. There is no overall assessment of the IT infrastructure's performance capability or consideration of peak and worst-case loading situations. Availability problems are likely to occur in an unexpected and random fashion and take considerable time to diagnose and correct.
3 / Defined Process Performance and capacity requirements are defined as steps to be addressed at all stages of the systems acquisition and deployment methodology. There are defined service level requirements and metrics that can be used to measure operational performance. It is possible to model and forecast future performance requirements. Reports can be produced giving performance statistics. Problems are still likely to occur and be time consuming to correct. Despite published service levels, end users will occasionally feel sceptical about the service capability.
4 / Managed and Measurable Processes and tools are available to measure system usage and compare it to defined service levels. Up-to-date information is available, giving standardized performance statistics and alerting incidents such as insufficient capacity or throughput. Incidents caused by capacity and performance failures are dealt with according to defined and standardized procedures. Automated tools are used to monitor specific resources such as disk storage, network servers and network gateways. There is some attempt to report performance statistics in business process terms, so that end users can understand IT service levels. Users feel generally satisfied with current service capability and are demanding new and improved availability levels.
5 / Optimized The performance and capacity plans are fully synchronized with the business forecasts and the operational plans and objectives. The IT infrastructure is subject to regular reviews to ensure that optimum capacity is achieved at the lowest possible cost. Advances in technology are closely monitored to take advantage of improved product performance. The metrics for measuring IT performance have been fine-tuned to focus on key areas and are translated into KGIs, KPIs and CFSs for all critical business processes. Tools for monitoring critical IT resources have been standardized, wherever possible, across platforms and linked to a single organization-wide incident management system. Monitoring tools increasingly can detect and automatically correct performance problems, e.g., allocating increased storage space or re-routing network traffic. Trends are detected showing imminent performance problems caused by increased business volumes, enabling planning and avoidance of unexpected incidents. Users expect 24x7x365 availability.

[To top of Page]

Toolset Considerations

Monitoring Agents
There are many, many systems on the market today that claim to provide some level of "service level management." Most of these systems confine themselves to monitoring network infrastructure equipment. While this is a very valuable capability, it doesn't go far enough to be considered to be true SLM. It also doesn't lend itself to meaningful customer service level reporting.

Some types of problem tracking or call management systems market themselves as having service level management functionality. In some ways, this is correct: many businesses use standard help desk tools to report on network availability, overall system availability, and customer service levels based on the types of trouble tickets received. This kind of "service level management" is, by nature, purely reactive, relying as it does on after-the-fact information capture of service failures. These tools also provide no mechanism for tying performance to business goals.

Still other types of systems will be a little closer to a mature implementation of SLM by using test transactions to simulate customer/end user activity. Essentially, these systems send a 'synthetic transaction" through standard customer system, timing them and monitoring for difficulties. Obviously, these systems confine themselves to network and application service level management, although they can make some measure of business success by examining employee productivity (or lack thereof, due to system problems) or customer system experience.

Web monitoring systems are becoming more prevalent. These systems examine the service experienced by web site users, and may also use test transactions to measure web site access events. There are also service level management systems that focus on specific technologies, such as VOIP, or on particular industries (telecom, ISPs).

Analytic Tools
The method used to link to new data sources should also be demonstrably simple. The administrator of a new system should not have to spend all of their time attempting to match file structures or running conversion routines in order to acquire a new data source.

Reporting must be flexible and easily customizable. It should be easy to set alarms to indicate when service levels are in danger of being breached, and these alarms should be easily configurable to alert for e-mail, pager, cellular phone, etc.

Finally, the system should be smart enough to adjust objectives based on changing conditions. An increase in the number of employees providing a particular service should be reflected in all the metrics associated with that service - without requiring administrator or business unit intervention.

Ideally, the system needs to be able to take data from systems or files associated with services, contractual agreements with outsourcers or customers, resources, finance and budgets, and processes (such as change management or knowledge management) and relate it into unified information that speaks directly to the organization's overall business goals and vision. In this way, true Service Level Management is achieved.

Prominent amongst available analytic tools and approaches are:

[To top of Page]

System Recovery Template

Executive Overview
Describe the purpose, scope and organization of the Availability Recovery document.

Scope
As not all IT Services may initially be included within the Availability Recovery document. Use this section to outline what will be included and the timetable for other services to be included.

Scope for the Recovery document may be determined by the business, therefore covering only a select few of the IT Services provided by the IT department that are seen as critical to the support of the business processes.

Note this document needs to differ from IT Continuity Recovery.

Include in the scope the difference between IT Service Continuity and Availability. This will depend on how the service is defined in a Service Catalogue.

To improve recovery, perform a Component Failure Impact Analysis (CFIA) on the system.

Service Availability Summary
This section of the document provides a summary of all the services listed in the document and the pertinent information regarding recovery of that service. It should be used as a check list.

IT Service

Owner

Business Process

Business Owners

SLA #/Service Catalogue Reference

Probability of Failure

Recovery Time

Recovery Procedure

Back Up Available and Tested

Data Capture

Service A

J. Ned

Billing

T. Smith

SLA001

 

 

<< List appropriate procedures. NB. This will require Incident and Problem Mgt input >>

Yes

 

Email

A. Boon

Communication

R. Jones

SLA234

 

 

 

 

 

SAP

C. Jones

Invoice and Payroll

P. Boon

SLA123

 

 

 

 

 

Service B

L. Smith

Marketing

R. Reagan

SLA009

 

 

 

 

 

Service C

R. Smith

Manufacturing

R. Smith

SLA007

 

 

 

 

 

Service A [One section for each service]

Service Description
In this section briefly describe the service.

Probability of Loss
In this section for this service describe the probability of a disruption to this service and the effect on the business. For example, will the loss of service invoke a contract that has set costs associated with it? If we lose this service can we expect to lose customers/clients/market share. Define each form that the loss of the service.

Service Degradation
Use this section to specify for this Service/application the speed at which it is likely that the situation regarding the loss of this service will degrade overall performance. That is, provide a score of 1 (low) to 10 (highest) that indicates how the service loss will grow in severity.

Escalation Score

(1 is slow/barely noticeable, 10 Rapid pace of overall deterioration)

Resulting Business Impact

9

Complete loss of Service. Company reputation at threat.

1

Minor Degradation. Customers unaware.

Escalation Procedures
Use this section to detail all escalation procedures. In the event of a failure in service it is important to provide a concise list of personnel that will need to be contacted. This will help reduce the service disruption time.

Priority

Hierarchical

Functional

Business

Name

Dept

Number

Name

Dept

Number

Name

Dept

Number

1

 

 

 

 

 

 

 

 

 

2

 

 

 

 

 

 

 

 

 

3

 

 

 

 

 

 

 

 

 

4

 

 

 

 

 

 

 

 

 

5

 

 

 

 

 

 

 

 

 

9

 

 

 

 

 

 

 

 

 

Device Dependencies
In this section list out those devices that are components of the service. Understanding this will help better pinpoint the area of failure, thus decreasing the time to respond and recovery.

IT Components (Configuration Items (CI))

CI #

Serial #

CI Name

Type

Sub-Type

Criticality

SER345

15434563

EMERO

Hardware

Server

High

RT5700

54444443

CISCO-002

Hardware

Router

High

RT4567

76547457

CISCO-001

Hardware

Router

High

MS001

N/A

MS Office

Software

Microsoft

Low

Business Needs
Use this section to describe any and all information that needs to be supplied to the business to help them manage the impact of failure on their processes. This will also help in setting the correct expectations and managing any issues that may arise due to the failure.

IT Needs and Resource Factors
Use this section to specify for this Service/application the combination of the complexity of facilities and the level of skills required in the people that will permit this service to stay operating, in the event of a failure.

List all necessary involvement with third party vendors as well.

Recovery Procedure
In this section you should list the recovery procedures for the above listed Configuration Items. We have added a simple procedure template as well.

IT Components (Configuration Items (CI))

CI #

CI Name

Type

Sub-Type

Criticality

Recovery Procedure

SER345

EMERO

Hardware

Server

High

 

RT5700

CISCO-002

Hardware

Router

High

 

RT4567

CISCO-001

Hardware

Router

High

 

MS001

MS Office

Software

Microsoft

Low

 

PROCEDURE TEMPLATE

Step

Task / Activity

Timing / Dependency

Expected Duration

1

 

 

 

2

 

 

 

3

 

 

 


Note: The expected duration column allows measurements to be taken so as to identify opportunities for improving recovery times.

Data Capture
In this section detail the required level of diagnostics that need to be captured in the event of failure. This will include things such as, Server logs, Application Logs or Error files, System Management tool diagnostics etc.

This information will be used in the Problem Management process to help identify the underlying cause and provide an avenue for removing the possibility of the failure.

Conclusion
Remember that the impact of loss will change over time. Therefore maintenance of this document should be performed on a regular time basis (to coincide with reviews of the Service Level Management - Service Catalog or Service Level Agreement reviews).

Appendices
Include any applicable appendixes that are needed.

Terminology
Like Terminology [To top of Page]

System Outage Analysis (SOA) Template

Executive Overview
Describe the purpose, scope and organization of the document.

Scope
This document will outline the pertinent areas of consideration when performing a Service Outage Analysis.

A Service Outage Analysis will generally be performed due to an outage occurring on an IT Service and the components involved in that service.

In this section detail the scope of the SOA.

Planning
In this section include the plan for the SOA for the service. This is much like performing a Post Implementation Review.

Record in this section of the document the following:

Data Sources
In this section detail a list of possible data sources need for the SOA. Potential sources of information are:

Resources
During the SOA you may require appropriate resources to complete the assignment. Potential resources are:

Schedules
Include in this section any appropriate Project Management plans. It is important to have a clearly defined scheduled, one that is distributed amongst the team, which will help you drive the SOA assignment.

Within this section list the following:

Hypotheses The next thing is to list all hypotheses regarding the Service Outage. This can be done in the following table:

Hypotheses

Probability

Investigative Area

List your hypotheses here

What is the probability of it being true?

Where will you look to get the information to prove it right or wrong?

 

 

 

 

 

 

Data Analysis
From the above table, gather the necessary data from the selected sources. Data Analysis techniques can very dramatically, and it is not the intent of this document to provide such techniques.

Create a table for each data source to capture the necessary information so that appropriate analysis can take place.

Provide a summary of the data in the following table

Hypotheses

Data Source

Data

Supportive

The hypotheses can be re-listed

What was the data source

What data was collected

Did it support the hypotheses

 

 

 

 

 

 

 

 

Interviews
Interviews are a key aspect of the SOA. They can provide better insight into the outage and the processes around it.

The "human factor" can provide more meaningful input than straight data. They will provide business and user perspective of the service outage.

By interviewing staff, it will be easier to determine where the real issues have occurred within the user community. The solution to this may be quite different as to where the technical data is pointing.

Interview the Problem Management team as well.

Findings and Recommendations
From your hypotheses and interviews, you should be able to provide a list of findings and recommend necessary solutions to help improve the end to end service availability. Recommendation can be captured in the below table:

Priority

Hypotheses

Findings

Recommendations

This column is used for prioritizing the solutions

List the hypotheses

List any findings, supportive or not, for the hypotheses

Provide the recommendations for improving the service availability.

 

 

 

 

Appendices
List any appendices needed in conjunction with this document.

Terminology
Similar to Appendix [To top of Page]



Visit my web site