20 Key High Availability Practices
  1. Spend money on availability - the return is worth it
  2. Assume nothing - achieving production-caliber levels of end-to-end system availability requires effort directed at testing, integration, and application level assessments.
  3. remove Single Point of Failure - remove points in the availability chain which will cause an outage if the device fails.
  4. Enforce Security - data which lacks integrity is worse than non available since it can be used erroneously by the organization
  5. Consolidate servers - fewer different configurations reduces the complexity of the infrastructure which means fewer machines which require backups, rebooting and, overall, fewer things which can go wrong
  6. watch your speed - keep tabs on performance - at some point users will become so frustrated that the system is, for all intents and purposes, not available. Also declining performance may augur for pending availability problems
  7. enforce change control - changes are a major source of failure since they introduce unknowns and risks into the environment
  8. ensure proper documentation - documentation provides audit trails of work that has been completed and ensures that procedures will last beyond the person who developed them
  9. use Service level Agreements - agreements reflect considerable negotiation with the business line as to what is acceptable levels of availability. They also spell out types of failures and outages and how long the system can be down as a result of them
  10. ensure planning of recovery in the event of failure - documented recovery plans should be developed, reviewed by management and maintained in process repositories
  11. test everything - procedures need to be developed outlining the type of testing which is needed for different kinds of changes introduced. testing should always be based on well-developed test plans
  12. separate environments - keep production, pre-production, quality assurance, development, laboratory and disaster recovery environments physically isolated from each other
  13. maintain documented lessons learned - particularly account for every incident and change which results in an incident. Ensure that root cause is determined for all incidents affecting mission-critical applications
  14. design for growth - build scalable systems which can increase their component capacities seamlessly without overly encumbering the ability of the organization to move to newer technologies when warranted.
  15. choose mature software - software which has been thoroughly tested particularly with reference to the architectural elements within an infrastructure
  16. choose mature, reliable hardware - this includes devises with high MTBFs and the logistical ability to manage spare parts
  17. reuse configurations - by limiting the number of different models, fewer elements will need to be revised, and overall fewer configurations will need to be running on the enterprise
  18. exploit external resources - things which have been done before are more likely to operate error free than newer technologies. Where it is impossible to avoid newer technologies you should ensure mechanisms for a proper trqnsfer of knowledge
  19. ensure than software is only used for the purposes it was intended - adapting a software product for uses outside its' specified purpose invites issues and trouble
  20. simplicity is desirable - Everything being equal the fewer parts something has the fewer thangs that can go wrong. Software products which do many things beyond what you require create additional points of failure without adding additional business value