Flight Data - Service Availability

BYTRON

By Bytron, , BYTRON

Delivering flight information electronically whether for ATC, Airline or Airport use means that both operational personnel and the general public have the latest and best information.

In a 24 x 7 business the target is to achieve 100% availability but, just as an aircraft can unexpectedly “go tech” – so can a computer. From the user perspective there is nothing more irritating than technology that doesn’t work or doesn’t provide the information required, especially when the consequences can be time-consuming and costly. Consequently, minimising the impact of service disruption on customers is always high priority. It must begin at the system design stage and continue right through to contingency planning and disaster recovery in the event of power failure. This article outlines how we, as a company, have achieved and continue to achieve high availability.

In the late 80’s the solution was to dualise systems with disk mirroring maintaining data integrity. On failure of the master, operations switched to the slave until the master was repaired. This usually required some manual intervention with a certain amount of downtime. By the early 90’s fault tolerant systems were the best technical solution. Whilst still extremely expensive in today’s terms they were designed and marketed as triple redundant providing 99.999% (five nines) availability. Our choice was the Tandem Non-Stop UX platform and we have one satisfied customer who obtained 99.998% availability over 10+ years (less than 2 hours downtime) and another system providing similar reliability that is still operational and just about to enter its tenth year of service. It should be noted that these systems operate in airport environments that have emergency generator back-up in the event of power failure.

By the late 90’s the relative cost of other high availability solutions was falling and clustering became the new solution. In 1999 we designed and implemented our own rack-mounted, triple redundant, cluster-based system for our Flight Data Centre to provide pre-flight briefing services. Back up was in the form of heavy duty Uninterruptible Power Supplies that could be daisy-chained together if necessary. A series of power failures and in particular a major 6 hour failure in 2001clearly showed that unless our contingency plan was modified to take into account the potential for serious (> 3hours) interruption of electrical supply, our end-users could suffer. Our then current disaster/contingency planning and the UPS’s ensured that our service was maintained for more than 3 of the 6-hour outage and enabled us to be well within our Critical “workaround (relief)” target of 8 hours but it clearly needed review. The review comprised of two main parts a) re-appraisal of internal solutions and b) requests for information/explanation from the local electricity distributor as to the reasons for such a long-term failure

For the first part, our ultimate solution was to invest in a back up generator (as in the airport environment) to supplement the DataCentre UPS installation. For the second part, obtaining a satisfactory explanation and/or action from the electricity distributor was not and easy task!

Our Data Centre is located in an area where most electricity is delivered by overhead power lines that can be vulnerable to the effects of storm activity. In response to our request for information, possible storm damage was the given reason for failure and we were sent an information booklet that was equally vague. As it was neither stormy nor blowing a gale on the night in question we were understandably not happy with this response. Further investigations proved some quite alarming statistics for the region. The following is an extract from our letter of reply.

“…… prompted to visit your Website and note that your targets for the end of Year 2000 were to reduce the average customer minutes lost per connected customer to 56 minutes. For the year 2000 and 2001 our minutes lost have been nearly five times greater. Also these statistics show that the region where we are situated does not fare well in the table having the:-

  1. highest number of minutes lost per customer
  2. highest number of customer interruptions per 100 hundred customers
  3. highest number of HV losses due to faults
  4. highest total losses due to faults
  5. highest total of pre-arranged outages
  6. highest number of minutes to first restoration for HV interruptions
  7. highest number of restorations that took more than 3 hours.

In this area at least, you have clearly not met your targets. Could we be connected to a rogue circuit as a power outage for more than 3 hours has occurred for the second time in less than 12 months?”

Fourteen months, numerous letters and four further major failures of 3+ hours later, it was eventually conceded by the distributors that there could be a problem with the circuit. The fault was traced and then repaired within 7 days. Since that time, apart from the odd “brown-out” the supply has only been affected during electrical storms and then only for a very short period of time.

Another resolution of the review was to implement a policy of recording every failure and observed voltage drop so that any increase in the frequency of these can be reported. We still maintain that policy today.

Whilst clustering in itself has provided very high availability, it does not have the advantages of remote performance monitoring that was supplied with every Tandem computer. We have developed extensive monitoring and alerting mechanisms but 3rd party diagnostics can never be as good as those of the manufacturer and for this reason we have recently installed a new generation Stratus ftServer. Stratus' industry-standard, fault-tolerant servers eliminate the operational complexity and high costs inherent in other high-availability approaches such as clusters and brings highest levels of uptime protection to industry standard Microsoft® Windows®environments. Supporting this is the Stratus ActiveServiceTMcapabilities built into every ftServer system that connects to a global service network for detecting, troubleshooting, and resolving problems fast - usually without the need for an on-site call.

BrightStrand Ltd, a strategic channel partner for Stratus in the UK, performed the installation and training and provides on-going support. BrightStrand are specialists in non-stop computing solutions for business continuity and have the knowledge and skill to advise and implement business critical solutions. We now feel we are able to offer services with a difference – services that focuses on preventing downtime instead of simply providing a remedy after the fact.

Whatever hardware platform is used contingency planning is essential to any business let alone one that operates 24 x 7. It should always look at the worst-case scenario – no matter how remote or ridiculous the possibility may seem.

The main aim in contingency planning is to minimise the impact on business continuity but disaster recovery should also figure prominently in those plans. (Who would have believed the events of 9/11?) A “what if” analysis should be performed. This should not only cover worst-case but also day to day operations such as loss of incoming and outgoing data communication lines and downtime of third party data delivery systems that are outside of your control. To stabilise business operations you need to know:

· What your core business functions and their dependencies are.

· What mission critical systems support the core business activity/function/process.

· What are the risks of failure of these mission critical systems.

· What impact such failure has on the viability and operation of the core business function (Business Impact Analysis).

Once you have performed this analysis and put your contingency plans in place, the key is to ensure staff has a high level of awareness of the procedures and that they is maintained and kept up to date. Without jeopardising current practice, contingency plans should be regularly tested for:

· Omissions

· Errors

· Levels of awareness

Testing should determine:

· Whether the plans are capable of supplying the required level of support

· Can/cannot be implemented in the time frame

· Costs when plan being used.

As with any contingency plan the key elements are - communicate, communicate, communicate; document, document, document.

And finally, review all contingency plans regularly - do not become complacent! Always assume that whatever has or has not been done to prevent problems – they are going to occur anyway!

Our cluster-based Flight Data Centre delivers up-to-the-minute, route specific, integrated flight briefings via ISDN to fixed bases and to a wider audience over secure Internet. It is currently operating on 99.995% availability after 7 years of service. Operations will shortly switch over to our new ftServer and with the help of Stratus and BrightStrand we aim to achieve the five nines target!

RSS