A down day is a death day


Or at least should be.

Like many other Professionals in the software industry I have been following the RBS debacle for the last few days. I was waiting for a reasoned, well researched answer. Unfortunately I was dissappointed when it turns out the answer was “Computer says No“.

Yet again the lack of expertise in computers has allowed a Bank that by rights should be sent to the wall by such an atrocious management of their systems means that Robert Heston can get away with saying, in effect, “There be monsters in them thar computers”. Like running a reliable banking operation using modern IT is as magical and hard to comprehend as frickin’ witchcraft or something. It’s not magical. It’s not that difficult. The more mission critical a system is the more experienced people you need to design, run, and maintain the systems. You need Pros in all these areas to ensure the systems are reliable. You also need to ensure that when you do an upgrade you have an effective roll back mechanism.

I’ve had my fair share of issues with upgrades. We had an issue where some students couldn’t access their grades if a system was loaded too high. This was understandable given there are two days of load on a system twice a year. Even then we managed to tweak the systems within a few hours to ensure 30000 students and staff could get to their grades. We had a backout plan, we executed it, and all was back to normal within a few minutes. We still maintained a close look on the systems anyway, and left late and came in early to ensure all was still well the next day.

For some reason this seems to not have been done at RBS. They performed an upgrade on a live system and it died a death. Then they rolled back the upgrade and the bank still a week later hasn’t rebalanced its books. This raises several important questions. I’m not particularly concerned about how quickly the system recovered – although serious questions need to be answered there – my concern is this: Given the seriousness and how wide spread the issues are how on Earth were these not picked up in testing? Or – Was this testing ever carried out? Was the upgrade done and everyone said ‘well it looks like it worked. New version number and everything!’ or did they upgrade, run a few hundred thousand ‘normal day’ transactions through the test systems, then checked all had worked well??? Just the upgrade, or full ‘day in the life’ test? Was a test plan even written up? Risk assessment? Any thought about consequences whatsoever?

These are very serious issues. If only junior staff were involved, and they are located half way around the world with experience of only the upgraded system and not the potential effects of that system on adjacent systems, then there will be hell to pay. The question is who will get the blame? If a Royal Navy ship sank and the crew died because the life vests were all knackered, who do you think will get the blame? Government or the Lieutenant in charge of them? Of course it will be the government and rightly so – not ensuring systems and life saving equipment were in place.

Who will get the blame for the upgrade going south? By rights it should be Robert Heston and his CIO that get sacked. Unfortunately because the general public are so woefully ignorant of anything as soon as a computer is involved he will get away with his teflon shouldered remark instead of being thrown out on his ass. He said “There was a software change which didn’t go right and although that itself was put right quickly, there then was a big backlog of things that had to be reprocessed in sequence, which is why on Thursday and Friday customers experienced difficulty which we are well on the way to fixing.”

Way too little, way way too late. And yet again our industry suffers because of general IT ignorance.

And the cover up and teflon begins.

Advertisements