ITIL Default Bottom Message Frame

20 Key High Availability Practices

Spend money on availability - the return is worth it
Assume nothing - achieving production-caliber levels of end-to-end system availability requires effort directed at testing, integration, and application level assessments.
remove Single Point of Failure - remove points in the availability chain which will cause an outage if the device fails.
Enforce Security - data which lacks integrity is worse than non available since it can be used erroneously by the organization
Consolidate servers - fewer different configurations reduces the complexity of the infrastructure which means fewer machines which require backups, rebooting and, overall, fewer things which can go wrong
watch your speed - keep tabs on performance - at some point users will become so frustrated that the system is, for all intents and purposes, not available. Also declining performance may augur for pending availability problems
enforce change control - changes are a major source of failure since they introduce unknowns and risks into the environment
ensure proper documentation - documentation provides audit trails of work that has been completed and ensures that procedures will last beyond the person who developed them
use Service level Agreements - agreements reflect considerable negotiation with the business line as to what is acceptable levels of availability. They also spell out types of failures and outages and how long the system can be down as a result of them
ensure planning of recovery in the event of failure - documented recovery plans should be developed, reviewed by management and maintained in process repositories
test everything - procedures need to be developed outlining the type of testing which is needed for different kinds of changes introduced. testing should always be based on well-developed test plans
separate environments - keep production, pre-production, quality assurance, development, laboratory and disaster recovery environments physically isolated from each other
maintain documented lessons learned - particularly account for every incident and change which results in an incident. Ensure that root cause is determined for all incidents affecting mission-critical applications
design for growth - build scalable systems which can increase their component capacities seamlessly without overly encumbering the ability of the organization to move to newer technologies when warranted.
choose mature software - software which has been thoroughly tested particularly with reference to the architectural elements within an infrastructure
choose mature, reliable hardware - this includes devises with high MTBFs and the logistical ability to manage spare parts
reuse configurations - by limiting the number of different models, fewer elements will need to be revised, and overall fewer configurations will need to be running on the enterprise
exploit external resources - things which have been done before are more likely to operate error free than newer technologies. Where it is impossible to avoid newer technologies you should ensure mechanisms for a proper trqnsfer of knowledge
ensure than software is only used for the purposes it was intended - adapting a software product for uses outside its' specified purpose invites issues and trouble
simplicity is desirable - Everything being equal the fewer parts something has the fewer thangs that can go wrong. Software products which do many things beyond what you require create additional points of failure without adding additional business value