Key points
- SLA protections are largely cosmetic – when a cloud platform fails at scale, contractual remedies fall far short of actual business losses
- Regulators are already acting: APRA’s material service provider register requirement signals that concentrated cloud dependency is now a systemic governance concern
- Most organisations have disaster recovery plans for risks that no longer exist, and none for the ones that do
Most businesses have a continuity plan for a fire in a server room they no longer have. Almost none have a plan for the day their cloud provider’s software platform malfunctions.
Over the course of more than two decades, I’ve seen underlying technology in organisations increase exponentially, as businesses transitioned from on-premises infrastructure, to offsite hosting, and then outsourcing it to cloud-based set-ups.
One element the new paradigm was unable to completely outsource was risk. But it became acceptable for digestible portions of risk to be shifted elsewhere within the organisation, largely to the provider via Service Level Agreements (SLAs).
As a technology strategist, I’m constantly asked which platform to buy. Which is best for the organisation in terms of features and functionality? Which is the most reliable? Which offers the best value?
But the one question I’m rarely asked may be even more important than brand selection – what happens if that platform disappears?
From fire drills to nothing
In the days of physical servers, continuity planning was tangible – offsite data centres were physical infrastructure you could walk through, point at, and examine. Boards could ask hard questions because they could see the thing they were asking about.
Migration to the cloud changed that. Data and platforms became abstract, hosted on sub-contracted cloud environments, accessed through SaaS layers that were removed (at least physically) from the organisations relying on them. For most executives and Boards, exposure was invisible, making them the hardest to govern.
But risks there were, and on one day in 2024 they became apparent to the world. CrowdStrike, one of the world’s most trusted cybersecurity providers, pushed a routine configuration update to its Falcon Sensor software, as it typically did multiple times a day.
The update caused 8.5 million Windows devices across the globe to crash. CrowdStrike identified the problem and reverted the update 78 minutes later, but every machine that had downloaded the update was already broken, and each one required manual, hands-on remediation.
In Australia, Qantas, Virgin Australia and Jetstar were forced to ground flights. NAB, Westpac and ANZ reported widespread disruptions to their banking systems. The ABC and Nine Network were unable to access their broadcast systems, while our largest supermarkets, Woolworths and Coles, saw point-of-sale terminals fail.
Fixing the affected devices took days because there was no automated path back.
Question the protections in place
To manage outage risk, most major platforms offer SLAs that outline expected uptime and remedies if things go wrong. But the protections these agreements provide are often more cosmetic than commercial.
SLAs typically provide service credits or partial refunds of subscription fees when uptime targets aren’t met. Delta Air Lines in the United States estimated that the CrowdStrike outage cost it US$500 million – but CrowdStrike’s standard contract caps liability at a fraction of that figure.
Vendors also rely on complex chains of subcontractors, cloud hosts, integrations and offshore processors. Each link in that chain creates another opportunity to narrow responsibility, limit liability, or point to someone else when things go wrong.
At this scale, one failure is a failure for thousands of organisations at the same moment, with no individual path to recovery – only a waiting game.
APRA, Australia’s prudential regulator, is alert to this dependency. It has required authorised deposit-taking institutions, insurers and superannuation trustees to submit material service provider registers by 1 October 2025, a clear signal that regulators are treating concentrated operational risk as a system-level governance issue.
What boards actually need to do
Risk assessment of this kind rarely happens at the right level. It gets overlooked in trusted vendor relationships, or delegated to procurement teams, and that’s simply the wrong place for it.
Three questions should be asked at a board level before a procurement decision, well before it’s live:
- In terms of data, what happens if we lose access for hours, days or weeks?
- What if the data, whether our own, our customers, suppliers or even employees, is wiped or irretrievably lost?
- And then, what is our response if the data is stolen or published?
Boards that treat these occurrences as only theoretical worst-case scenarios often discover they are not.
Every organisation needs someone with a mandate to ask uncomfortable questions about vendor resilience. Call it healthy paranoia — a structured willingness to challenge the assumptions baked into procurement decisions.
What’s the actual uptime history of this platform? When did we last audit the sub-contractor chain? What are we genuinely indemnified for?
That may be a Chief Financial Officer, a Chief Operating Officer, or a trusted adviser – but whoever it is, they need to bring enough authority to slow a decision when the answers don’t hold up.
A four-step plan to close the gap
While modern organisations may enjoy the cost and productivity benefits of outsourced platforms, it cannot be at the expense of business operations.
Four steps can make a big difference:
- Stocktake your tech stack – list every critical cloud and SaaS system ranked by business criticality, not contract value. Analyse the flow on effects of loss of service and where you cannot afford to lose access, say so explicitly.
- Identify realistic redundancy – export key CRM, payroll and private data, alternate communication channels, documented manual workarounds. Tested, practical and operational backups and redundancies that you control.
- Wargame scenarios – what happens if you lose each major tool for a day? A week? Who decides, who communicates, and who manages the vendor relationship when it turns adversarial?
- Shift procurement culture – leave behind set-and-forget for ongoing due diligence: regular checks for vendor breaches, outage history, and material changes to sub-contractor arrangements.
When the Canvas Learning Management System was hit by a cyber attack earlier this year, over 8,000 schools, universities and TAFEs were affected: teachers lost access to lesson plans, students couldn’t submit assignments or revise for exams. Some schools lost the ability to community with parents. The outage lasted for two weeks, but some institutions were able to recover much faster than others. The main difference was their level of planning and preparedness, allowing them to stand up a temporary solution – or even revert to a manual process – and continue with their “core business” of teaching and learning. If one of your major cloud providers suffered an outage, how would you get your business trading again?