Airport Articles
Subscribe to ArticlesThe Continuous Demand for Continuous Availability
BYTRON
Category: Continuous IT Availability | 18/01/2006 - 14:33:21
Today, availability is fast becoming a mandatory business requirement for many companies for their business-critical applications; that is, applications that when they are down the whole business grinds to a halt or worse.
A bank no longer has manual procedures to overcome a failed IT system, stock exchanges rely totally on the on-line trading application, emergency services are as available as the IT system that supports them, a pharmaceutical manufacturing factory will not just stop but have to junk at least its entire days output if just the smallest outage occurs and the consequences of computer failure in Air Traffic systems is well documented.
These are not inconveniences caused by lack of IT availability, but absolute disasters that affect the organisation and its customers. Almost every enterprise has at least one mission-critical application where an outage would be catastrophic.
To mitigate the risk of disaster, more and more lines of business are demanding very high availability targets backed by stringent service level agreements (SLAs) from their IT departments.
This has driven the IT industry to create a multitude of different availability solutions meet these needs – largely revolving around specific technology platforms. Unfortunately the rule of thumb has become the higher the degree of availability, the more proprietary the solution, the more complex to run, and the more expensive to deploy and maintain.
This article looks at some of the technologies that deliver high availability solutions and how to evaluate them. As a starting point, IDC (the premier global market intelligence and advisory firm in the information technology and telecommunications industries) defines availability across four levels:
- AL1 – Contains some fault tolerant components. In the event of a system failure, work simply stops.
- AL2 – Typically found in warm stand-by clusters. When a system fails, user connection is impacted and the transaction is lost.
- AL3 – The user connection will be maintained in the event of a failure, however the transaction will be lost. This is commonly found on automatic failover implementations.
- AL4 – The highest of all implementations, whereby failure is transparent to the user and results in no transaction loss. Such systems have 100% component replication, a feature of fault tolerant servers.
Measurement of availability levels is communicated in the percentage of system availability over a period of time – typically a year. For example:
- 99% - 87.6 hours
- 99.9% - 8.76 hours
- 99.99% - 52.5 minutes
- 99.999% - 5.25 minutes
A few minutes downtime a year may not sound much, but to put these figures into context a 10 minute outage on an air traffic control system can create chaos not only in the air but also on the ground with knock on effects for airports, airlines and, not least, the fare-paying passenger.
To drive the availability to the highest level, fault-tolerant or clustered systems are used to run the critical applications – but the real issue is on the prevention of any downtime in the first place.
Continuous availability defined
From the outside, a continuously available network closely resembles a conventional network. It consists of servers running applications, databases and networking software.
The servers are linked with storage arrays by router-based networks running protocols over hard-wired and wireless connections. In a manufacturing environment, interfaces with
plant floor systems pull in data from individual machines.
The primary difference between a conventional and a continuously available network is their approach to errors. Continuous availability’s focus is on error prevention.
This differs from the prevailing attitude in corporate networking, where the focus is on recovery from errors and failures. Recovery-oriented solutions assume downtime, even if it’s only a few minutes during failover from one server to another.
Continuously available systems are built from the ground up around redundancy and error detection that prevents failure. The entire system has to be designed and configured for the purpose and to scale across the entire network environment. This is called fault tolerant computing.
Problems often arise in organisations where an attempt has been made to upgrade the availability, often through the use of clusters. The issue that arises is that this additive approach to availability focus on individual points of failure and does not provide a holistic solution to continuous availability.
For example, a crash that occurs outside the server, in a network interface card or in an application running on the server, will still cause an outage.
Additive approaches such as server clusters aren’t going to solve those problems and neither will any other individual network element. Hardening a network means considering the hardware and software as a whole. In the continuous availability lexicon, software doesn’t run “on” hardware, they run together.
All of the potential failure points between the two have been discovered and hardened against crashes. The components themselves must be of high quality, with built-in management and monitoring functions so IT staff can anticipate failures and head them off. Hardware systems should be fully redundant to guard against downtime.
Just as a natural ecosystem is affected when a new species is introduced, so is a continuously available ecosystem. Application upgrades, patches and routine maintenance can cause slowdowns and crashes, as can simply plugging an unauthorized laptop into the network.
Companies must factor together all of these elements – hardware, software, outside influences – in the design of a continuously available network.
High availability through Clustering
Clustering technology is considered to be the most widely used option for achieving higher server availability. A clustered solution is typically a configuration of two or more servers that interoperate with each other to increase availability.
Meanwhile, continuously available hardware is where ‘fault tolerance’ is built into the hardware itself, transparent to the solution and the user.
There are many benefits that come from implementing a clustered solution. Not only do you achieve higher levels of availability, but you can also achieve higher levels of scalability using tools such as load balancing. That said, there are many problems associated with clusters as well.
To start with, you need somebody who understands clustering order to implement and maintain them. Such skills may cost companies up to twice as much as an IT specialist without clustering skills. It has often been said that clusters represent the best example yet seen of ‘expert friendly technology’.
For an application to be supported by a clustered solution, it must also be aware that it is running in a clustered environment. Licensing costs for the application may also have to be considered: you often find yourself paying for two licences instead of one.
Overall, clusters bring a lot of complexity to the business, which often spells more cost initially and over the lifetime of the solution. While clustering is capable of delivering excellent levels of system availability, the collection of published data, the experience of many companies indicate that clustering results vary widely.
Creating a successful high availability environment using clustering requires a combination of careful planning, best-of-breed hardware components, mission-critical level service contracts, and disciplined testing and change management processes on the part of the internal IT staff.
Companies that are not willing to recognize the costs involved including money, time, and resources will sooner or later pay the price when the cluster fails to perform as intended.
A closer look at fault tolerant computing
In the past and very much so still in the present you’ll find many implementations of fault tolerant technology where downtime is not an option. These systems have in the past been proprietary and came with a price tag few can afford.
But times are rapidly changing, and fault tolerant solutions built on industry standard platforms are now becoming available at affordable price-points for all sizes of enterprises. For example companies such as Stratus have recently introduced fault tolerant systems based on the Intel architecture supporting the Windows 2000/2003 operating system.
Companies previously forced to compromise availability can now enjoy the benefits of continuous availability through fault tolerance.
So how does this technology work, and what added benefits does it deliver over mainstream clusters?
Instead of the multiple boxes approach used by clusters, fault tolerant technology looks to eliminate single points of failure using replicated components that continue uninterrupted processing even in the event of a component malfunction. Hardware faults are handled automatically by the system, without failover delay or data loss.
Using lockstep technology, server systems are able to maintain multiple CPU-memory units in precise synchronization executing the same instructions at exactly the same clock cycle. Lockstep processing ensures that any errors, even transient errors, are detected and that the system can survive any CPU-memory unit error without interrupting processing and without losing any data or state.
While many servers now offer duplicated power supplies, fans and disk drives, fault-tolerant systems offer extra protection for core system components that include motherboards, processors, memory, I/O buses, and I/O adapters.
Another advantage of this approach is that the server presents a single-system view and runs a single copy of all software, which reduces software licensing costs and simplifies administration as compared with multi-node cluster alternatives.
The fault-tolerant I/O subsystem is physically separate from the CPU-memory subsystem. Hardware logic, in the form of custom ASICs, acts as a PCI bridge between the CPU and I/O, and provides the core error detection, fault isolation, and synchronization logic for the lockstep architecture.
These ASICs are the Stratus North PCI (SNP) ASIC and the Stratus South PCI (SSP) ASIC. The SNP contains the primary PCI interfaces, interrupt control functions, and transaction ordering logic.
The SSP contains the voting logic, secondary PCI interfaces, and error registers. The ASICs use a passive bus, which the second-generation ftServer design implements in the form of a backplane, to connect the replicated CPU and I/O modules within the server.
Fault-tolerant I/O is implemented through the use of replicated PCI buses, replicated I/O adapters, and replicated devices. Base configurations should include two independent PCI buses with additional buses that can be configured.
All critical PCI adapters are duplicated as well: SCSI, Ethernet, remote management, and Fibre Channel. Internal and external SCSI disk storage is mirrored (RAID 1), connected via two independent SCSI buses.
Multiple paths are therefore available to any logical I/O operation, including both internal and external storage operations. Any I/O operation failure will result in a retry using an alternate path that ensures successful completion of the I/O operation.
Which way to go?
Clustering technology can provide a highly available application environment if adequate time, effort and resources are devoted to proper planning, installation, configuration and operation of the clustering solution. The issue is that it leaves to many areas open to error – and with that the risk of compromised availability.
Errors or shortcuts in selection and configuration of hardware, cluster software configuration and customisation, cluster testing, change management, support contracts, staff training and use of consulting services can all result in a clustering solution that falls far short of delivering the availability levels expected.
In many cases, a poorly implemented cluster will actually deliver lower availability than a single, standalone server solution.
In the “Guide to Creating and Configuring a Server Cluster under Windows Server 2003”, Microsoft states that “Server clusters do not guarantee non-stop operation, but they do provide sufficient availability for most mission-critical applications”.
In the same document, Microsoft also states “For Windows Clustering solutions, the term “high-availability is used rather than ‘fault tolerant`. Fault-tolerant technology offers a higher level of resilience and recovery.”
Finally, it is import to note that failure recovery in a cluster occurs, in most cases, only after a server crash. The server crash that triggers a cluster failover can result in data loss or corruption as indicated in Microsoft’s warning against performing certain types of failover testing on a production server.
For any organisation considering clustering, it makes sense to also consider the alternative of a fault-tolerant server. But if the applications in question are life-critical to the business or safety-critical to operations, then continuous availability through fault tolerance is a necessity. The good news is that it is now an affordable necessity.