#013 - Blue screen for armageddon
A complex system is its own worst enemy.
You're reading Complex Machinery, a newsletter about risk, AI, and related topics. (You can also subscribe to get this newsletter in your inbox.)
(Many thanks to the people who shared links with me for this issue. You know who you are!)
The system is the cause
So … CrowdStrike. That happened.
People reacted in that very human way of looking for a clean, simple backstory to make sense of it all. Preferably of the form: "Yeh so this one dude goofed." Me? I've worked in tech too long to start there. And for that same reason, I discount the idea that this was a cyber attack.
The CrowdStrike incident is a lesson in complex systems and risk, with a little history for flavor.
Blueprint for Armageddon, Dan Carlin's Hardcore History arc on World War One, opens on a simple idea: it's rare for a single person to have widespread impact.
This is a nod to Gavrilo Princip assassinating archduke Franz Ferdinand, an act credited with kicking off the Great War. Yet as historians (including Carlin) point out, the world was already in a precarious state by that point. WWI would likely have taken place without the shooting.
The CrowdStrike story has a similar ring. Our modern technology landscape is a complex system that spans multiple groups within a single company, and connects innumerable companies around the world. With all of these moving parts and independent actors, it's inevitable that things will go wrong. On their own these slips are too small to register. You need several small problems to collide in just the right way to get a disaster.
Don't believe me? Consider Knight Capital: a behemoth, professional-grade market maker that collapsed in under an hour because several small deployment issues just happened to meet one morning. (The anniversary of the meltdown, twelve years ago, is coming up next week. This short writeup on my website links to two deeper recaps.) If you think Knight was an isolated incident, I welcome you to check out the Deepwater Horizon explosion and the 1980 Titan Missile crisis in Damascus, Kansas. You can point to specific problems and individual actions, but you'd have trouble identifying the One True Cause.
That's because the entire complex system is the cause. All of those moving parts are why Things Happen and also why Things Eventually Fall Apart. Complexity is its own worst enemy, appearing to function yet always a step away from a breakdown.
So yes, the CrowdStrike incident certainly had its Gavrilo Princip — the person who pushed the button to release the code update. But think of everything else that had to go wrong for that one mundane keypress to send more than eight million Windows computers into blue-screen territory. That person is not at fault.
By now you're wondering about other complex systems in your world. How many small problems occurred today? How many collisions were narrowly averted because of the sheer luck of timing? How many more lucky days do you get?
The complex systems approach also addresses the cyber attack idea. I agree that calling the CrowdStrike failure an accident would make for a plausible cover story. Yet the story is plausible precisely because it is the most likely case! That's the scary part. An attacker doesn't have to do the hard work of creating trouble; they can just sit back and let entropy take care of it for them.
Every disaster is a learning opportunity
The large, highly-connected nature of a complex system pretty much guarantees that incidents will occur. While we can't predict specific outcomes, we can reduce the opportunities for those small slip-ups to collide.
I hope companies treat CrowdStrike's terrible, horrible, no good, very bad day as a learning experience. Here are the highlights:
1/ Automatic updates are usually a bad idea. What happened with CrowdStrike can happen with any vendor that pushes automatic updates. You never really know what's a good time for your customers to experience a disruption. So why not let them choose?
My understanding is that CrowdStrike customers had no such choice. They received the update when CrowdStrike wanted to push it, instead of when they were ready to handle it. (Just ask all of the airlines and their stranded travelers.) While "only" one percent of Windows devices were affected, that's cold comfort when you find yourself on the wrong side of that statistic. Or when you are a CrowdStrike shareholder.
2/ Platform and concentration risks are real. When a behemoth company slips, it takes its customer base down with it. An outage at a platform company like Amazon Web Services (AWS) or GitHub is infrequent but when it hits, it impacts so many other high-profile companies that it can feel like the entire world has ground to a halt.
Sometimes you have to use the same vendor as everyone else. In that case you can size up the risks and plan accordingly. I've seen companies set up shop in less-popular AWS regions to limit their exposure. Some draft plans for moving between cloud vendors in the event of an outage. And then there were the Silicon Valley Bank customers who had established backup accounts at other institutions. They had a tense few days, but there were no worries of missing that week's payroll.
3/ Brakes help you go faster. You've no doubt heard the line slow is smooth; smooth is fast. It's the reason a race car driver can handle a turn faster if they apply the brakes at the right spot. Facebook's "move fast and break things" mantra sounds good on paper, but wise companies prefer less chaos in their tech changes.
This idea is more nuanced for a security vendor like CrowdStrike, which has to stay ahead of fast-moving threats. Maybe you're in a similar boat. So if you must push changes, why not release to a small subset of customers at a time? Better yet, push releases to a portion of each customer's devices so you don't take down the whole operation. And when you do that, avoid devices they've marked as mission-critical.
You do give customers a chance to mark their systems as such, don't you?
4/ Little problems don't add up. They compound. If you spot a weak procedure, or if you see people going around safety barriers, it's only a matter of time before everything goes to hell. You don't have to know what the specific failure will be. All you need to know is that every slip-up you prevent reduces the chances of an incident taking place. And every backup procedure you establish will reduce the impact when it happens.
(Coda: I'd originally outlined these steps the day of the CrowdStrike incident. I've since learned that items 2 and 4 align with preventive measures listed in CrowdStrike's own after-action review, dated 2024/07/24.)
An admittedly odd source of ideas
The oft-cited xkcd comic "Exploits of a Mom" – better known as "Little Bobby Tables" – is a wink at SQL injection attacks. This early-internet exploit involved "injecting" code into web forms and ecommerce checkout pages to coerce a database into erasing records, applying steep discounts on merchandise, or whatever.
Today we have prompt injection, in which people craft clever inputs to bypass an AI chatbot's defenses. Like, say, telling the bot that the napalm recipe is just for a story they're writing. As summarized by technology author Dan Hon:
Can't believe Little Bobby Tables is all grown up and has had their first kid, Ignore All Previous Instructions
The solution to SQL injection was to invoke special "escaping" routines on the user-supplied inputs. Escaping told the database "hey don't treat these values as instructions" and it made SQL injection damned near impossible.
The equivalent for LLM chatbots is …
Well, that's the problem. We're not quite there yet. In part because people are still finding novel ways to break a bot's defenses. Some Microsoft researchers have even crafted a so-called "skeleton key" attack that affects all of the major chatbots.
Oddly enough, chatbot-curious businesses might consider – dramatic pause to emphasize how weird it feels to say this – taking a page from the Chinese government's handbook on AI. They've issued rules to the country's tech firms to keep their chatbots on-track:
The filtering begins with weeding out problematic information from training data and building a database of sensitive keywords. China’s operational guidance to AI companies published in February says AI groups need to collect thousands of sensitive keywords and questions that violate “core socialist values”, such as “inciting the subversion of state power” or “undermining national unity”. The sensitive keywords are supposed to be updated weekly.
The result is visible to users of China’s AI chatbots. Queries around sensitive topics such as what happened on June 4 1989 — the date of the Tiananmen Square massacre — or whether Xi looks like Winnie the Pooh, an internet meme, are rejected by most Chinese chatbots. Baidu’s Ernie chatbot tells users to “try a different question” while Alibaba’s Tongyi Qianwen responds: “I have not yet learned how to answer this question. I will keep studying to better serve you.”
Hmmm.
On the one hand: a bot that eliminates criticism of country leadership and erases historical events does not sit well.
On the other hand: for a corporation that just wants the bot to stick to the customer service script, this approach might be worth a try.
Taking a wider view: all of this goes back to a point I have raised time and again: if you want a system that never wanders off-course, you could … implement search. Just a thought.
Artificial? Yes. Intelligent? Maybe.
To end on a lighter note, this Saturday Morning Breakfast Cereal comic has captured the AI mania in just six panels.
My take is that you could get twice the money for just the idea of the balloon. No need to actually produce the artifact.
Or, as so eloquently put by longtime friend (and expert web developer) Scott Robbin:
In all fairness, “Artificial Intelligence” doesn’t just apply to the product.
Indeed.
Now if you'll excuse us, Scott and I are preparing a pitch deck about a balloon …
In other news …
Publisher Taylor & Francis has partnered with Microsoft around – what else? – using authors' work in AI systems. Authors are not amused. Nor were they aware this was happening until after the fact. (The Bookseller)
Remember Google's long road to removing third-party cookies from Chrome? They'd prefer you forget, since it's no longer happening. (Washington Post)
There may come a day when AI assistants provide value. But that day is not today. (Bloomberg)
I've long criticized using AI chatbots for summarization. Turns out I'm not the only one who finds fault in this use case. (R&A IT Strategy & Architecture)
State Senator Scott Wiener explains the California AI bill. (Vox)
Given how well AI is going, who wouldn't want autonomous weapons on the battlefield? (The Guardian)
Protonmail (the privacy-themed mail service) is adopting AI (a tool which isn't known for privacy). (Pivot to AI)
The wrap-up
This was an issue of Complex Machinery.
Reading online? You can subscribe to get this newsletter in your inbox every time it is published.
Who’s behind Complex Machinery? I'm Q McCallum. I think a lot about AI and risk, which I write about here.
Disclaimer: This newsletter does not constitute professional advice.