Agents Are Just Distributed Systems. Fight Me.
Bytes and Being — Take 1: Agentic Systems
We are starting to see some cracks in the “AI will code everything” scenario. It is not just the code that matters in the end — it is the entire system that needs to be considered.
Look at this tweet from an internal Amazon note — and I am not sure how it made it to X, but here we are. Coincidentally, I was already writing a post on agents, distributed systems, and what to consider when designing with agents. Better time to talk about this than now.
It is interesting. Amazon let go of large teams from Audible and other verticals over the past months — I can only assume on the basis of going big with AI. And yet this internal note suggests they now want human sign-off before AI-assisted code reaches production. Something does not add up.
Their thinking is right. The tactic, I would question. Adding a senior engineer to review every AI-assisted push is not a safety system — it is a bottleneck with a hard hat on. My answer is not more humans in the loop. It is better system design. And that design already exists. We just stopped reading the papers.
This is what I have learnt designing agents for the past 3+ years — hat tip to Vishal, when we started our first agent on a knowledge graph at Scoutbee with Fun times. Hard lessons.
In this post I walk through five principles from distributed systems that would have prevented the Amazon incident — and that will let you run fast without falling on your face.
Agents are distributed systems on steroids. That is a recipe for amazing things and a recipe for disaster — too much power, without proper consideration of what can go wrong.
The good news is that we have been building distributed systems for 30 to 40 years. The problems that come with making agents safe are not new. The solutions are not new either. We just stopped teaching them.
AI and agents are here to stay and grow. Combining that with everything we have learnt from running reliable distributed systems will give us the right foundation to move fast — and not break everything in the process.
Principle 1 — Blast Radius
The first thing I think about when automating anything is blast radius. Not after the design. Before it.
At Omnius, we were building AI systems for insurance claim automation. The question kept coming up: should the system automatically reimburse or settle a claim? The technology could do it. The answer we landed on was no — not because of the technology, but because of the blast radius. An AI system running 24/7, settling claims autonomously, hitting an edge case at 2AM — that is a potential hundred million dollar loss before anyone wakes up. We kept the human in the loop for final approval, reduced the manual work by 60 to 80 percent, and kept the catastrophic tail risk off the table. That thinking has stayed with me.
The Amazon incident is the same failure at a different scale. An AI coding tool with permissions to delete and recreate production environments. Nobody asked the foundational question before granting those permissions: if this agent goes wrong, what is the worst it can do? The answer, it turned out, was: thirteen hours of downtime and customers in mainland China without service.
Tools like OpenClaw — which give agents full system access, browser control, and the ability to execute scripts — are genuinely exciting. They are also the exact scenario where blast radius thinking is not optional. “Full system access” is another way of saying “unlimited blast radius.” That is fine when you are the only user, on your own machine, with full context. It is a different conversation when you are designing automated workflows for a company, running 24/7, touching production systems.
The question is not whether to use powerful tools. The question is: have you thought through what happens when that power is applied incorrectly at 2AM?
With agents, tools, and modern frameworks, the default thinking is automation first, blast radius second. That is fine for prototypes. For production workflows, it is a liability.
Here is what I would put in place instead.
First, by default, give agents read permissions only. Write access is opt-in, explicit, and justified — not assumed. Every tool an agent can call should be evaluated for its worst-case action, not its intended action.
Second, build a “everything that can go wrong” manual — a catalogue of failure scenarios specific to your agent and its toolset. Then go further: apply chaos engineering principles in a sandbox environment. Let the agent do its worst in a controlled space. Run it with misconfigured tools, bad inputs, adversarial prompts. Learn the unknown unknowns before production does.
Third, when a new workflow is introduced, it does not go straight to production. It runs in the sandbox first. If it triggers anything in the failure manual, a human reviews it before it touches the real environment. Once it has a clean record, it earns its place in production.
The 80/20 rule applies here. For 80 percent of cases, agents and tools will behave exactly as intended. The 20 percent is where all hell breaks loose — and that 20 percent is not evenly distributed. It concentrates in edge cases, novel inputs, and 2AM on a Sunday. Design for that 20 percent and you get the automation you need with eyes exactly where they are necessary.
If there is one rule to take from this section: avoid write permissions by default. Make the agent earn them.
Principle 2 — Circuit Breakers
Electricity is about as ubiquitous to us as anything in modern life. Transformers move power hundreds of kilometres. The machinery, the wires, the infrastructure — all of it has been engineered and improved over decades to get electricity safely into your home. And yet, your circuit board still has a fuse. A simple, humble circuit breaker. Because all that engineering upstream does not change what happens when something goes wrong inside your house at 3AM.
That circuit breaker has saved millions of lives. Not because the rest of the system failed — but because the designers knew that no matter how good the infrastructure, you need a last line of defence that does not negotiate.
Netflix built the same thing for software. Hystrix was a circuit breaker for microservices — a library that monitored calls between services and, when a dependency started failing, opened the circuit before the failure cascaded. One slow microservice in a distributed system does not stay contained. Without a circuit breaker, it backs up, accumulates, and eventually brings down the whole thing. Hystrix stopped that. It did not fix the failing service — it isolated it, so everything else could keep running.
Agents need the same thing. Not as a nice-to-have. As infrastructure.
What that looks like in practice is an observer system — a separate process, not logic inside the agent itself — that constantly watches the agent’s state graph. Every action the agent takes, every tool call it makes, every token it consumes, every second it runs. The observer has policies. When any policy threshold is breached, it opens the circuit. It does not ask the agent. It does not wait for the next checkpoint. It stops the process, preserves the state, and alerts a human.
The policies themselves depend on the domain — but the categories are consistent: time limits, token budgets, action depth, cost ceilings, and action type restrictions. That last one is the one Amazon needed. A policy that says: agents cannot delete databases, environments, or any configuration that has been put in place by a human. They can create a sandbox. They can never deprecate an existing one. That single policy, enforced by a watchdog, would have made the Amazon incident impossible.
The other thing a watchdog gives you — beyond prevention — is diagnosis. When the circuit opens, you have a complete record of everything the agent did up to that point. The state graph, the tool calls, the decisions, the costs. That is your post-mortem. That is how you learn what went wrong, and what policy to add next.
Observe first. Open the circuit when required. Learn from every trip.
Principle 3 — Byzantine Fault Tolerance + Proof of Work
CI/CD changed everything. How we develop, how we deploy, how we observe, how we build. It is one of the most important shifts in software engineering in the past two decades — and I say that as someone who has been in this field for 16 years.
Now imagine CI/CD with coding agent swarms. Not a thought experiment — a reality. Anthropic, OpenAI, Google — all of them are building IDEs and infrastructure for agents that write code, run tests, and deploy autonomously. Almost no human touch, apart from a human describing what they want to solve. For someone who started their career manually merging branches and praying the build did not break, this is genuinely phenomenal.
And genuinely dangerous if you do not design it right.
Here is the failure mode that keeps me up at night. When an agent fails explicitly — it halts, throws an error, requires human intervention, attempts to self-heal — that is actually the good outcome. A loud failure is a contained failure. The agents that fail silently are the problem.
Picture this. An agent in a CI/CD pipeline is waiting on a security audit from another agent before it continues with deployment. The security audit responds with HTTP 200. Everything looks fine. The pipeline continues. Except the security agent silently failed — it could not run the scanner it was supposed to run, returned an empty result with a success status code, and nobody noticed. The deployment went through. The vulnerability went with it.
We have actually solved this problem before. In 1982, Lamport, Shostak, and Pease published the Byzantine Generals Problem. (Btw, Leslie Lamport talks about why they gave it the name Byzantine General Problem in the talk linked below). Without going into the full depth of it — the question they asked was: how do nodes in a distributed system reach consensus when some of them might be lying, faulty, or actively malicious? The answer they arrived at: you cannot trust any node’s output at face value. You need proof that cannot be faked.
Satoshi Nakamoto borrowed that insight 26 years later. In Bitcoin, you do not trust a miner’s claim that they did the work. You verify the hash. The proof is the trust. You cannot fake a valid hash — the mathematics makes it impossible. That is what made the whole system work.
Agents need the same primitive.
Trust is only by proof — proof of the work that has been done. Without proof, trust no one. That is a policy for distributed systems. It is also, increasingly, a policy for life.
In practice, this means every agent in a system must return not just a verdict but an evidence envelope. The security agent does not return “pass” or “fail.” It returns: lines of code scanned, modules tested, errors found, warnings raised, rules executed, coverage percentage — and then a flag: continue or stop. The receiving agent does not consume the verdict blindly. It evaluates the envelope. If the coverage percentage is zero, it does not matter that the status code was 200. The envelope failed. The circuit stops.
Without that proof of work, agents are running blind. And a CI/CD pipeline full of blind agents is not automation — it is a liability with a deployment button.
Principle 4 — Idempotency
My first real lesson in idempotency came at SAP, on a team called Team Sentinel. We were building an algorithmic trading platform on SAP HANA — an in-memory data store — and I was doing what we would now call ML and data engineering. I was young, new to everything data, and armed primarily with Java and confidence I had not yet earned.
My job was to design a pipeline that pulled end-of-day stock values for 3000 tickers from Yahoo Finance, running as cronjobs. As expected, the jobs failed. Rerunning them required manual effort and always produced the same result: thousands of duplicate records. As with anyone who is sufficiently lazy and sufficiently burned, we eventually built in retries, deduplication, and proper failure handling. But the lesson I took from that experience has stayed with me through every pipeline, every system, and every automation I have built since.
If an operation is not safe to run twice, it is not safe to run automatically.
We are now in the world of long-running agents. If you have not seen Andrej Karpathy’s autoresearch agent — go look at it. Agents are no longer short, bounded tasks. They are multi-hour, multi-day systems: one agent making thousands of tool calls, or thousands of agents making hundreds of tool calls each, running across many hours, branching, failing, recovering, rerunning. The SAP cronjob problem at civilisational scale.
When these agents fail — and they will fail — the question is not just how to restart them. The question is: what happens to everything they already did? Does a document verification get triggered twice? Does a deployment happen twice? Does a farmer get charged twice for a field inspection that already ran?
The answer depends entirely on whether idempotency was designed in from the start.
What I propose — and what I am actively building into my own projects — is to keep the execution state as a graph stored externally. Every branch of the agent’s execution is tracked outside the agent itself. When a failure occurs and a branch is reinvoked, the system checks the external graph first. If that branch already completed successfully, it is skipped. The work is not repeated. The side effects do not compound.
This external execution graph also enables something I find deeply interesting: reinforcement learning on top of agent paths. At any given decision point, an agent can take many paths. The path it took is stored. The paths it did not take are also stored. Over time, you can compare outcomes across paths — which decisions led to better results, which led to failure, which were unnecessarily expensive. You can train agents to make better decisions not just from outcomes, but from the full topology of choices that were available to them.
Idempotency is the foundation that makes all of this possible. Without it, reruns corrupt the data. With it, every run — successful or not — becomes a clean, comparable data point.
I wrote about an early version of this thinking in Git for Thoughts, where I used Git as a version control layer for AI conversations. What I am building now extends that into full Agentic DAGs — execution graphs that are versioned, resumable, and comparable across runs. More on that soon.
The principle is simple, even if the implementation is not: every action an agent takes should be safe to take twice. Build that guarantee in from the start, not after the first incident.
Principle 5 — MAPE-K
A few months ago I was ideating on product strategy at Agreena. There was a business idea I needed to go deeper on — building a sound investment thesis required lateral thinking, different perspectives, people who would push back and bring skills I did not have. It was a week before Christmas. Most of my colleagues were away.
The idea was itching my brain and I had to scratch it.
LLMs have ingested a vast amount of human knowledge and thinking. So I asked: what if I gave them personas and let them talk to each other like peers? I built a multi-agent system — different agents, different personas, different instructions, each aware of the others in the loop and of my role in the conversation. I wanted a peer discussion among equals, not a Q&A with an AI.
It started to work. And then it started to break in an interesting way. The agents, when they had questions, would stop and ask me — regardless of whether the question was consequential or trivial. I had become the bottleneck. Every small ambiguity, every minor clarification, routed back to me. That was not peer discussion. That was a very expensive way to talk to myself.
So I introduced two more agents. An orchestrator, whose job was to keep the peer conversation flowing — making sure agents talked to each other rather than routing everything through me. And an observer, who watched the entire conversation from a distance, tracked questions being asked, monitored the trajectories being explored, and only stopped to bring me in when something consequential required my judgement.
Once that system was in place, I spent four hours in it — going back and forth, watching ideas collide, following threads I would never have pursued alone. I built a completely different take on the strategy. I learned things I did not know I did not know.
That was also the moment I understood something important: AI systems are not just about making us efficient. They are now at a point where they make us effective. There is a difference.
Later, when I went back to understand what I had actually built, I found it had a name. MAPE-K. Monitor, Analyse, Plan, Execute — with a shared Knowledge base underneath all four stages. IBM Research, 1990s, autonomic computing. The pattern I had independently assembled from necessity had been sitting in the literature for thirty years.
My observer was Monitor and Analyse. My orchestrator was Plan and Execute. The shared context between them — the running log of questions, trajectories, and decisions — was the Knowledge base.
Long-form agents need this. I believe that now with some conviction.
Had Amazon designed their agent process with a MAPE-K approach, the observer layer would have caught the plan before it executed. Not just the action — the intent. An orchestrator reviewing a plan that includes “delete and recreate the environment” would have flagged it before a single command ran. The circuit breaker would have had something to trip on
.
In an agent swarm with thousands of things happening simultaneously, designing without MAPE-K is not just a reliability risk. It is a trust risk. Every catastrophic failure, every wasted run, every system that does the wrong thing confidently — it erodes the belief that AI systems can actually deliver on what they promise. And that belief, once lost, is hard to earn back.
Trust No Node
Things are changing at breakneck speed. The way we build, the way we run systems, the way we think about what software can do — all of it, faster than any previous shift I have seen in sixteen years of doing this.
I have always asked myself: what matters more, speed or value? With AI, we are starting to see both at the same time. That is genuinely extraordinary. And it is also exactly when discipline matters most — because the faster you move, the more damage a wrong turn causes.
I do not read the Amazon internal note as a cautionary tale about moving too fast. I read it as a case study in what to improve. Their instinct — slow down, add a human — is understandable. It is also not the answer. A senior engineer signing off on every AI-assisted push does not scale. It is a bottleneck wearing a hard hat, and it will slow Amazon down without making their systems meaningfully safer.
The answer is to build the right foundation. Five principles. Thirty to forty years of distributed systems wisdom, applied to agents:
Blast radius — know the worst before you grant the permission. Circuit breakers — a system that cannot stop itself is not a system, it is a liability. Byzantine fault tolerance — trust the proof, not the status code. Idempotency — if it is not safe to run twice, it is not safe to run automatically. MAPE-K — monitor, analyse, plan, execute. Let the system heal itself before it needs a human.
Our learning loops need to be faster. Better. The speed that AI gives us is only as valuable as the reliability underneath it. Just because we can move fast does not mean we should forget everything we learned about building machines that last.
This is the time to forge it all together — the new and the proven — and do what we humans do best.
Create.
Trust no node. Trust is built. Trust is designed.








