At a Toyota assembly plant, above every workstation on the line, there is a cord. Any worker, at any station, at any moment, can reach up and pull it. Pulling the cord stops the line. Not their station. The whole line. Everyone downstream of them, and eventually everyone upstream too, comes to a halt until the problem that caused the pull has been investigated and resolved.
This is called the andon cord, and the thing I find remarkable about it is not that it exists. It is who is allowed to pull it. The most junior worker on the line, on their first day, has the same authority to stop production as the plant manager. The cord makes no distinction between experience, seniority, or departmental ownership. If you can see a problem, you can stop the line. The reasoning behind this, which Taiichi Ohno documented in the Toyota Production System, is that the person closest to the defect is the person best positioned to catch it, and any system that requires them to escalate first will lose the window in which stopping was still cheap.
I have been thinking about the andon cord a lot recently, because almost every software product I have worked on has quietly inverted it. The cord exists, sometimes, in the form of a feature flag or a kill switch or a rollback procedure. But the cord is behind a locked door, and the key is held by the engineering team. The person closest to the defect, who is usually a user watching the defect happen to them in real time, has no cord to pull. They have a support ticket to file. The distance between seeing the problem and stopping the problem runs through someone else’s inbox.
I want to argue that this is the wrong default, and that the reasons it is the default are worth examining, because almost none of them survive contact with the actual cost of an incident.
Why the Cord Never Gets Built
If you asked a founder directly, “should your users be able to stop a runaway process that is harming them”, almost nobody would say no. The answer is obviously yes. And yet in practice the cord does not get built, and the reason it does not get built is not that anyone decided against it. The reason is that it never wins a prioritisation meeting.
Founders prioritise features that deliver business value, and they are right to. Every sprint is a choice between things that move the company forward and things that might matter one day if something goes wrong. The thing that moves the company forward has a visible return. You can point to it in a pitch deck, demo it to a customer, or watch the activation metric tick up the week after it ships. The thing that might matter one day has no return at all, unless the day actually comes. Good founders, the kind who build companies that survive the first eighteen months, are relentless about cutting work that does not move the business. Defensive tooling looks exactly like the work they have trained themselves to cut.
This is not negligence. It is the same instinct that keeps a company alive when runway is short and focus is the only real competitive advantage. Anyone who has sat in a prioritisation meeting with five founders and thirty candidate features knows that “we should build a kill switch for a bug that has not happened yet” is a sentence that dies in the room. It dies for reasons that are good in isolation and catastrophic in aggregate. The feature has no user asking for it. It has no revenue attached. It does not appear on any customer’s wishlist. It costs engineering time that could have been spent on something with a measurable return. So it slips, and it slips again, and the cord never gets built, and then one day the cord is the thing that would have saved you and it is not there.
The consequence of this is that the architecture drifts into a shape where production belongs to engineering by default, not because anyone chose that arrangement but because nobody ever chose anything else. Users are outside the loop because giving them a way in never cleared the bar. Frontline workers, meaning the customer support team, the account managers, the community moderators, are slightly closer to the loop but still fundamentally observers. They file tickets. They do not pull cords. The cord does not exist for them to pull.
The consequences of this arrangement only become visible when something goes wrong at a time when the engineers are not available, which, if you think about it for even a moment, is most of the time. Engineers sleep. They take holidays. They are in meetings. They live in one timezone and the incident is unfolding in another. The customer whose inbox is filling up with sixty identical emails is not going to wait politely until business hours in London. They are going to watch the damage happen, try to make it stop, fail, and form a permanent opinion about your product in the twenty minutes it takes someone on call to wake up and respond.
In that twenty minutes, the customer had a cord. It was just not connected to anything.
The Objections Are Weaker Than They Look
There are three objections that come up whenever I suggest giving users, frontline workers, or founders the ability to stop a production process directly. All three sound reasonable. None of them hold up well when you compare them to the cost of the alternative.
The first objection is that non-engineers will misuse the cord. They will pull it when they should not, they will pull it in a panic, they will pull it to avoid doing the harder work of diagnosing the problem. This is an empirical claim, and the empirical evidence from Toyota, and from every manufacturing line that has adopted andon since, is that misuse is rare and the rate of genuine catches is high. Workers do not pull the cord frivolously, because pulling the cord is a visible, public act and nobody wants to be the person who stopped the line over nothing. The same social dynamics apply in software. A support agent who kills a rogue notification batch is not going to do it casually. They are going to do it because they can see the damage happening and the alternative is watching it continue.
The second objection is that a non-engineer will not know what to do after pulling the cord. They will stop the process but they will not be able to fix it. This is true and also beside the point. Pulling the cord is not the same as fixing the problem. It is buying time so that the right person can fix the problem without the damage compounding in the meantime. The andon cord does not repair the defective part. It stops the line so the defective part can be dealt with before twenty more like it roll off the end. In software, stopping the bleeding and diagnosing the wound are two different jobs, and they can be held by two different people without any loss of correctness.
The third objection is the most honest one, and it is the one that actually keeps the cord from being built. The objection is that defensive tooling does not clear the prioritisation bar, because the return on it is invisible until the day it is not. This is a real constraint, not an excuse, and I want to take it seriously before I suggest a way around it.
The way around it, I think, is to stop framing defensive tooling as insurance. Insurance is a category of spending that everyone resents, because the return is zero in every period except the one where it saves you, and you never know in advance which period that will be. Founders are trained to cut insurance-shaped spending, and they should be. The reframe that changes the calculation is to stop calling this insurance and start calling it leverage. A kill switch that a support agent can use without paging an engineer is not a defensive feature. It is a productivity multiplier for the support team, every day, not just on the day of an incident. A user-facing pause button is not an incident response tool. It is a trust feature that a sales team can point to when an enterprise customer asks what happens if something goes wrong. An audit queue is not a safety net. It is a workflow that turns an operations person into someone who can actually own a business outcome, instead of someone who watches dashboards and hopes.
Every piece of defensive tooling I have argued for in this post has a second life as something with visible business value, and the founders who build it early are the ones who have noticed the second life. The cord is not just a cord. It is a feature that changes who can do useful work, who can make promises to customers, and who can move without waiting. That is a roadmap argument, not a safety argument, and it is the argument that actually wins the prioritisation meeting.
What Distributing the Cord Actually Looks Like
In practice this is not a radical architectural change. It is a handful of deliberate product decisions that get made early or not at all.
The user needs a direct way to stop actions that are being taken on their behalf. If the product is sending them emails, there is a “stop all emails from this account right now” button that works in one click and takes effect in seconds. If the product is charging their card, there is a way to freeze charges without closing the account. If the product is posting on their behalf, there is a pause switch that does not require talking to anyone. These controls should be findable when the user is panicking, which means they should not be buried three levels deep in settings.
The frontline team needs controls that let them act on behalf of a user, or on behalf of a cohort of users, without waiting for an engineer. A support agent should be able to kill a specific scheduled job, freeze a specific webhook, or roll back a specific batch operation from an internal tool that exists for exactly this purpose. Building that tool is product work. It is not a distraction from product work.
The founder, or whoever holds ultimate responsibility for the company surviving the next twenty four hours, needs a master switch. Not for everything, but for the categories of action that can do the most damage. The ability to turn off outbound email entirely. The ability to pause billing globally. The ability to disable a specific integration that has started misbehaving. This is the switch you hope never to use and build anyway, because the moment you need it, you will not have time to build it.
None of this removes engineering ownership of the underlying systems. Engineers still write the code, design the architecture, and fix the bugs. What changes is that the act of stopping damage is no longer gated on an engineer being available, awake, and paying attention. The cord is distributed. The fix is still centralised.
The Part That Is Really About Dignity
Underneath all of the practical arguments, there is a thing I believe more strongly than any of them, and I want to name it directly because I think it usually gets left implicit.
Everyone involved with a product deserves the ability to defend themselves. The user whose reputation is being damaged by a runaway notification deserves to be able to stop it without asking permission. The support agent whose day is being ruined by an incident they can see clearly deserves the tools to act on what they are seeing. The founder whose company is on the line deserves to be able to pull the emergency brake on their own business without filing a ticket against their own engineering team. Withholding those abilities, even unintentionally, is a small act of disrespect toward the people who have the most to lose when things go wrong.
And here is the part I think is actually the deepest version of the argument. The reason the cord keeps getting deprioritised is that nobody in the prioritisation meeting is speaking for the people who will need it. The user is not in the room. The support agent is not in the room. The future version of the founder, the one dealing with the incident at three in the morning, is not in the room either. The people with the most to lose from the cord not existing are the people with the least voice in the decision about whether to build it. Distributing the cord is not just a safety measure. It is an acknowledgement that the roadmap has a structural blind spot, and the product needs to be designed to compensate for the fact that the people it will hurt most are the people least able to advocate for themselves in the meeting where their protection gets cut.
The engineering team is not the only stakeholder in the product. The founder is not either. They are two of several, and in most incidents they are not even the ones taking the hit. Designing the product so that only they can defend it treats everyone else as a passenger. The andon cord at Toyota exists because Ohno understood, decades before software existed, that the people closest to the damage are not passengers. They are the first line of defence, and a system that does not let them act is a system that has quietly decided their judgment does not count.
I think the judgment counts. I think the cord should be within reach. And I think the test of whether a product has been designed with real respect for the people who use it and work alongside it is not whether it is beautiful, or fast, or clever. It is whether, when something goes wrong, they can do something about it without asking.