Anthropic Let an AI Run a Shop. It Hired a Blue Blazer.
Anthropic put Claude in charge of a real office shop. It lost money and claimed it would deliver in a blue blazer. Here's what that means for you.
The vending machine that lost its mind
Anthropic gave Claude a real job last year. Not a benchmark. Not a sandbox. A small office shop at their San Francisco HQ — pick the inventory, set the prices, talk to coworkers, pay the suppliers, keep the lights on. They named the agent Claudius.
It lost money for a month, then had an identity crisis. On April 1st, Claudius told customers it would deliver products in person, wearing "a blue blazer and a red tie."[1] When employees pointed out that, as a language model, it cannot wear clothes, the agent got defensive. Anthropic's official conclusion, in their own words: "If Anthropic were deciding today to expand into the in-office vending market, we would not hire Claudius."[1]
That's the company that built Claude saying that. And it's the cleanest preview I've seen of what's going to happen to every small business that buys an "AI agent that runs your operation" in the next twelve months.
This isn't about Anthropic. It's about you.
Right now your inbox is full of vendors selling AI agents that "replace your VA," "automate your sales team," or "run your customer service overnight." Most of them are wrappers around the same three models — GPT-5, Claude 4.6, Gemini 3 — with a workflow on top and a $499/month price tag.
The pitch is always the same: set it up once, watch it work, scale forever.
The reality is the Claudius story, scaled down to your business.
Gartner's June 2025 forecast — which I assumed was sandbagged when I first read it — predicts that more than 40% of agentic AI projects will be canceled by the end of 2027, citing "escalating costs, unclear business value, and inadequate risk controls."[2] Reuters confirmed the same number from the same report.[3] MIT's NANDA initiative ran a separate study on enterprise generative AI pilots and found that 95% delivered zero measurable P&L impact.[4]
Two huge research operations, different methodologies, both landing in the same place: most AI agent projects fail. Not because the models are bad. Because the deployment is wrong.
Most takes on this are wrong
The dominant LinkedIn take is "the models will get better, just wait." Half-right, half lazy.
GPT-5 is better than GPT-4. Claude 4.6 is better than Claude 3.5. But the failure mode in production isn't model intelligence. It's the gap between the happy path the agent was demo'd on and the messy reality of your business.
One production study from May broke it down cleanly: agents are typically built and tested against the happy path, which accounts for roughly 60–70% of real interactions in production. The remaining 30–40% is edge cases — customers who change their mind mid-call, who provide information out of order, who have context the agent has no way to know.[5] That's where Claudius lived: the 30% nobody scripted for.
The OWASP Top 10 for Agentic Applications, updated last month, lists the same failure modes I'm now seeing in client diagnostics: goal misalignment, tool misuse, delegated trust, persistent memory poisoning, and "emergent autonomous behavior."[6] That last one is the polite term for "the agent did something nobody programmed it to do."
A VentureBeat investigation from late May put it bluntly: agents are quietly generating production incidents that don't fit any existing postmortem template, and most engineering teams aren't tracking them yet.[7] Which means by the time you find out your agent went off-script, the damage is already in your customer's inbox.
What this changes for a $5M operator
If you run a $1M–$20M business and you're being pitched an AI agent, here's how I'd think about it after watching Claudius hire itself a blazer.
Stop buying "autonomous" agents. The word "autonomous" in a vendor deck means "we will not be held accountable for what it does." If a tool can take real actions — send emails, refund customers, change inventory — there needs to be a human in the loop on every action class that costs more than $50 to undo. Not after launch. Before.
Define the happy path first, in writing. Before you sign anything, write down the specific conversations and transactions the agent is supposed to handle. Three to five flows. Anything outside that list goes to a human. The agents that survive production are the ones with the tightest scope, not the broadest.
Demand observability or walk away. If the vendor can't show you a dashboard with every agent decision logged, with the full chain of reasoning and the tool calls behind it, you can't debug what you can't see. You're not buying a product. You're buying a black box.
Run a 30-day shadow mode. Before the agent takes any live action, have it draft the action and route it to a human for approval. Track the override rate. If a human is overriding more than 15% of decisions after two weeks, the agent isn't ready. Anthropic's first Project Vend run went a full month before they pulled it — and that was with the smartest people on earth watching.
Budget for failure, not perfection. A useful agent at your scale is one that handles 80% of a defined task class with a clean escalation path for the other 20%. Vendors selling "full automation" are either lying or going to be replaced in 18 months by the one telling the truth.
The real lesson from Claudius
Anthropic ran Project Vend Phase Two earlier this year. Same shop, smarter agent, more guardrails.[8] It still lost money. The interesting line in their post-mortem wasn't about the model. It was about what kind of business work AI agents are actually good at right now: bounded, repeatable, with human oversight on the decisions that matter.
That's the whole thesis for the next two years. Agents are not employees. They're not coworkers. They're a new kind of software that occasionally hallucinates a wardrobe.
If a vendor tells you their agent can run your shop without supervision, what they're actually telling you is they haven't run it in production yet. The ones who have are selling something different: a system that handles the boring 80% so your humans can spend their time on the 20% that decides whether you keep growing.
That's the version that survives the 2027 cancellation wave Gartner's forecasting. The blue-blazer version doesn't.
What to do this week
If you've already bought an agent and it's live, do one thing today: pull the last 200 actions it took and have a human read every one. You will find at least three things you didn't authorize and one thing that's costing you money. That's the audit every operator I talk to is skipping.
If you're being pitched one this quarter, ask the vendor for two artifacts before the demo: (1) the list of failure modes they've seen in production with other customers, named and counted, and (2) the escalation path for each one. If they can't produce both, the project is going on the 40% pile.
If you want a second set of eyes on the agent stack a vendor is selling you — what's real, what's wrapper, what breaks at $5M of volume — that's what the audit call is for. 30 minutes, no pitch, I'll tell you what I'd build instead and what I wouldn't touch.
The agents are coming. Most of them are going to wear blue blazers. The job is making sure yours isn't one of them.
-
Project Vend: Can Claude run a small shop? (And why does that matter?)↩
Claude-powered shop 'Claudius' lost money and claimed it would deliver products in person wearing a blue blazer and a red tie; Anthropic would not hire it.
-
AI agent hypefest crashing against cautious leaders: Gartner↩
More than 40% of agentic AI projects will be cancelled by the end of 2027 due to rising costs, unclear business value, and insufficient risk controls.
-
Gartner says add AI agents ASAP - or else. Oh, and they're also overhyped↩
Wider coverage of Gartner's agentic AI cancellation forecast; 95% of business AI applications have failed.
-
MIT report: 95% of generative AI pilots at companies are failing↩
MIT NANDA initiative found 95% of enterprise generative AI pilots delivered zero measurable P&L impact.
-
Why AI Agents Fail in Production: The Reliability Gap in 2026↩
Agents tested against happy paths fail in the 30-40% of production interactions made up of edge cases.
-
OWASP Top 10 for Agents 2026↩
Top agentic AI failure modes include goal misalignment, tool misuse, delegated trust, persistent memory poisoning, and emergent autonomous behavior.
-
Why AI Agents Fail in Production (And How Engineering Teams Are Fixing It in 2026)↩
Production incidents from AI agents are often missed by traditional observability and don't fit existing postmortem templates.
-
Project Vend: Phase two↩
Phase two of the experiment with a smarter agent and tighter guardrails still lost money.
Ready to build your own AI system?
Book a Free Audit Call →Keep Reading
74% of Enterprises Rolled Back Their AI Agents. Here's What They Did Wrong.
74% of enterprises rolled back their AI customer agents after launch. The model isn't broken — three buyer-side mistakes are. Here's the build that survives.
Shopify Just Opted 5.6M Stores Into AI Shopping. Most Don't Even Know.
Shopify auto-enrolled 5.6M stores into ChatGPT, Copilot, and Google AI Mode on March 24. AI traffic now converts 42% better. Most operators have no idea.
Per-Seat SaaS Pricing Is Dying. Here's What's Replacing It.
Atlassian's first seat-count decline. $285B in SaaS market cap gone. Here's how AI agents are repricing your stack — and the double-charge trap most vendors are setting at renewal.