Doubt a power surge caused BA’s IT fiasco – bad system, bad planning etc more likely
A massive failure of British Airways’ IT system left 300,000 passengers stranded around the world. This will be remembered as a catastrophic event for BA. And there are many questions about what happened. BA said it was due to a “power surge” that was “so strong that it rendered the back-up system ineffective”. But some experts have subsequently publicly expressed their doubt about how true that is, and do not believe a power surge would be able to bring down a data centre, let alone a data centre and its back-up. One said that would mean either bad design of the system, or some other explanation. Normally a data centre would have surge protection, which is there to protect against exactly this problem. There should also be an uninterruptible power supply, and proper earthing systems. The companies supplying the area where BA holds its data say there was no power surge. Experts say much of the problem was the time taken to reboot the system. But the overly-complex IT system is largely outsourced to India – and many of the experts in UK who initially helped to cultivate and develop the network left when the jobs were moved. The extent of the BA problem may be due to poor crisis management planning, an under-trained and under-staffed IT support team and a poor understanding of the wider logistics. The reputational costs to BA could be huge and very significant.
The Times, on 2nd June 2016, said:
The incident is thought to concern an uninterruptible power supply (UPS) which delivers a smooth flow of power from the mains, with a fall-back to a battery back-up and a diesel generator.
This week BA’s parent company, International Airlines Group (IAG) admitted that the supply to Boadicea House, a data centre, was temporarily lost. An internal investigation found that the UPS, believed to have been supplied by Hitec Power Protection, was functioning correctly at the time.
One BA source said it was rumoured that a contractor doing maintenance inadvertently switched the supply off, although this has not been confirmed.
An internal email from Bill Francis, head of group IT at IAG, appeared to confirm this version of events. The email, leaked to the Press Association, said: “This resulted in the total immediate loss of power to the facility, bypassing the backup generators and batteries . . . It was turned back on in an unplanned and uncontrolled fashion, which created physical damage to the system.”
5 talking points around the British Airways IT failure
An IT system failure left 300,000 passengers stranded around the world in what will be remembered as a catastrophic event for British Airways. But what really happened?
By Roy Manuell (Junior Editor, International Airport Review)
31 May 2017
Why might we have reason to doubt the “power surge” claims?
According to Álex Cruz, BA’s chairman and chief executive, the catastrophic IT failure that left hundreds of thousands of passengers stranded over the UK Bank Holiday Weekend was due to a “power surge.”
It is alleged that the surge was “so strong that it rendered the back-up system ineffective”.
Several experts have subsequently publicly expressed their doubt as to the validity of this claim. “Multiple data centre designers have told the Guardian that a power surge should not be able to bring down a data centre, let alone a data centre and its back-up,” the British newspaper reported on Tuesday 30 May.
“It’s either bad design or there’s more to the story than just a power surge,” said James Wilman, chief executive of the data centre consultancy Future-tech. “You have something specifically that you build in to a data centre called surge protection, which is there to protect against exactly this incident. You also have an uninterruptible power supply, a UPS, and part of its job is to condition the power” – ie smooth out the peaks and flows in current.
“Between those and a quality earthing system, you should be protected from power surges,” Wilman said.
According to another leading British newspaper, The Times, SSE and UK Power Networks, both electricity companies that provide energy to the geographical location at which the airline holds its data centre have denied the possibility of a power surge.
Why was the impact so large?
An unnamed corporate IT expert, speaking to the BBC, further suggested that a power failure shouldn’t have even caused a “flicker of the lights” in the data-centre due to the presence of the UPS – the uninterruptible power supply.
Essentially, the scale of the impact and the amount of people that were affected is largely due to the time taken to reboot the system.
Why did BA’s reboot take so long?
Once the power was lost, the airline’s crisis management plan should have kicked in but as many media outlets have suggested, the overly-complex IT system is largely outsourced to India and many of the experts who initially helped to cultivate and develop the network left when the jobs were moved.
If you saw the amount of old infrastructure that this country is hanging off of, you wouldn’t sleep at night.
Many believe that the time taken to reboot remains a combination of poor crisis management planning, an under-trained and under-staffed IT support team and an incomplete understanding there-and-then of the complex logistics of air travel in the 21st Century.
Is the airline industry’s infrastructure so outdated?
Returning to James Wilman once again, he suggests that airlines’ IT systems are fundamentally outdated and notably the British communications infrastructure is too old.
“We were leading the communications curve back 20 years ago, and the problem is that that now means that much of our infrastructure is hanging off a 25-year-old backbone. Some data centres are reaching the end of their life. And how do you refurbish that when you can’t turn it off?
“If you saw the amount of old infrastructure that this country is hanging off of, you wouldn’t sleep at night,” he said.
The cyber threat and conclusions to come?
It is unfair on BA to over-speculate as the investigation is only just underway and information is only beginning to seep through.
Many fear that cyber threats may have been behind the failure while others argue that it was just one big mistake. Irrespective, it has happened and serious questions now need to be asked as the implications of the failure could be monumentally significant.
International Airport Review will be covering the story as more information comes through.
British Airways could have to pay £100m compensation bill to passengers due to its huge IT failure
British Airways could face a bill of at least £100 million in compensation for its passengers affected by the cancellations and delays caused by its IT systems failure. The problem, perhaps caused by a loss of electric power, which then lead to most systems not working, resulted in BA flights around the world being unable to take off, passengers unable to check in, even the website not working. The problem affected Heathrow the most in England, as the largest base for BA. Gatwick was also affected. In total about 1,000 flights were affected, with problems likely to last several days more, while systems are fixed and planes get back into the right places. As this computer fault is entirely the fault of BA (and not any sort of “act of God”) BA will be liable to pay full compensation, to anyone delayed over 3 hours. The airline was particularly busy as it was the start of the school half term, and also a Bank Holiday weekend, with people flying for weekends away. The GMB union said the problem had been caused in part because BA made many good IT staff redundant in 2016, to save money. They instead outsourced the work to India. Besides the huge cost of compensation (and improving its IT resilience) BA will have suffered serious reputational damage, with many saying they would avoid ever flying with BA again.