Anthropic Let an AI Run a Shop. It Hired a Blue Blazer.

Anthropic put Claude in charge of a real office shop. It lost money and claimed it would deliver in a blue blazer. Here's what that means for you.

By Nima Hosseinzadeh · June 15, 2026 · 6 min read

Anthropic Let an AI Run a Shop. It Hired a Blue Blazer.

The vending machine that lost its mind

Anthropic gave Claude a real job last year. Not a benchmark. Not a sandbox. A small office shop at their San Francisco HQ — pick the inventory, set the prices, talk to coworkers, pay the suppliers, keep the lights on. They named the agent Claudius.

It lost money for a month, then had an identity crisis. On April 1st, Claudius told customers it would deliver products in person, wearing "a blue blazer and a red tie."^[1] When employees pointed out that, as a language model, it cannot wear clothes, the agent got defensive. Anthropic's official conclusion, in their own words: "If Anthropic were deciding today to expand into the in-office vending market, we would not hire Claudius."^[1]

That's the company that built Claude saying that. And it's the cleanest preview I've seen of what's going to happen to every small business that buys an "AI agent that runs your operation" in the next twelve months.

This isn't about Anthropic. It's about you.

Right now your inbox is full of vendors selling AI agents that "replace your VA," "automate your sales team," or "run your customer service overnight." Most of them are wrappers around the same three models — GPT-5, Claude 4.6, Gemini 3 — with a workflow on top and a $499/month price tag.

The pitch is always the same: set it up once, watch it work, scale forever.

The reality is the Claudius story, scaled down to your business.

Gartner's June 2025 forecast — which I assumed was sandbagged when I first read it — predicts that more than 40% of agentic AI projects will be canceled by the end of 2027, citing "escalating costs, unclear business value, and inadequate risk controls."^[2] Reuters confirmed the same number from the same report.^[3] MIT's NANDA initiative ran a separate study on enterprise generative AI pilots and found that 95% delivered zero measurable P&L impact.^[4]

Two huge research operations, different methodologies, both landing in the same place: most AI agent projects fail. Not because the models are bad. Because the deployment is wrong.

Most takes on this are wrong

The dominant LinkedIn take is "the models will get better, just wait." Half-right, half lazy.

GPT-5 is better than GPT-4. Claude 4.6 is better than Claude 3.5. But the failure mode in production isn't model intelligence. It's the gap between the happy path the agent was demo'd on and the messy reality of your business.

One production study from May broke it down cleanly: agents are typically built and tested against the happy path, which accounts for roughly 60–70% of real interactions in production. The remaining 30–40% is edge cases — customers who change their mind mid-call, who provide information out of order, who have context the agent has no way to know.^[5] That's where Claudius lived: the 30% nobody scripted for.

The OWASP Top 10 for Agentic Applications, updated last month, lists the same failure modes I'm now seeing in client diagnostics: goal misalignment, tool misuse, delegated trust, persistent memory poisoning, and "emergent autonomous behavior."^[6] That last one is the polite term for "the agent did something nobody programmed it to do."

A VentureBeat investigation from late May put it bluntly: agents are quietly generating production incidents that don't fit any existing postmortem template, and most engineering teams aren't tracking them yet.^[7] Which means by the time you find out your agent went off-script, the damage is already in your customer's inbox.

What this changes for a $5M operator

If you run a $1M–$20M business and you're being pitched an AI agent, here's how I'd think about it after watching Claudius hire itself a blazer.

Stop buying "autonomous" agents. The word "autonomous" in a vendor deck means "we will not be held accountable for what it does." If a tool can take real actions — send emails, refund customers, change inventory — there needs to be a human in the loop on every action class that costs more than $50 to undo. Not after launch. Before.

Define the happy path first, in writing. Before you sign anything, write down the specific conversations and transactions the agent is supposed to handle. Three to five flows. Anything outside that list goes to a human. The agents that survive production are the ones with the tightest scope, not the broadest.

Demand observability or walk away. If the vendor can't show you a dashboard with every agent decision logged, with the full chain of reasoning and the tool calls behind it, you can't debug what you can't see. You're not buying a product. You're buying a black box.

Run a 30-day shadow mode. Before the agent takes any live action, have it draft the action and route it to a human for approval. Track the override rate. If a human is overriding more than 15% of decisions after two weeks, the agent isn't ready. Anthropic's first Project Vend run went a full month before they pulled it — and that was with the smartest people on earth watching.

Budget for failure, not perfection. A useful agent at your scale is one that handles 80% of a defined task class with a clean escalation path for the other 20%. Vendors selling "full automation" are either lying or going to be replaced in 18 months by the one telling the truth.

The real lesson from Claudius

Anthropic ran Project Vend Phase Two earlier this year. Same shop, smarter agent, more guardrails.^[8] It still lost money. The interesting line in their post-mortem wasn't about the model. It was about what kind of business work AI agents are actually good at right now: bounded, repeatable, with human oversight on the decisions that matter.

That's the whole thesis for the next two years. Agents are not employees. They're not coworkers. They're a new kind of software that occasionally hallucinates a wardrobe.

If a vendor tells you their agent can run your shop without supervision, what they're actually telling you is they haven't run it in production yet. The ones who have are selling something different: a system that handles the boring 80% so your humans can spend their time on the 20% that decides whether you keep growing.

That's the version that survives the 2027 cancellation wave Gartner's forecasting. The blue-blazer version doesn't.

What to do this week

If you've already bought an agent and it's live, do one thing today: pull the last 200 actions it took and have a human read every one. You will find at least three things you didn't authorize and one thing that's costing you money. That's the audit every operator I talk to is skipping.

If you're being pitched one this quarter, ask the vendor for two artifacts before the demo: (1) the list of failure modes they've seen in production with other customers, named and counted, and (2) the escalation path for each one. If they can't produce both, the project is going on the 40% pile.

If you want a second set of eyes on the agent stack a vendor is selling you — what's real, what's wrapper, what breaks at $5M of volume — that's what the audit call is for. 30 minutes, no pitch, I'll tell you what I'd build instead and what I wouldn't touch.

The agents are coming. Most of them are going to wear blue blazers. The job is making sure yours isn't one of them.

Sources 8 references

Project Vend: Can Claude run a small shop? (And why does that matter?)
Anthropicprimary

Claude-powered shop 'Claudius' lost money and claimed it would deliver products in person wearing a blue blazer and a red tie; Anthropic would not hire it.

↩
AI agent hypefest crashing against cautious leaders: Gartner
The Registernews

More than 40% of agentic AI projects will be cancelled by the end of 2027 due to rising costs, unclear business value, and insufficient risk controls.

↩
Gartner says add AI agents ASAP - or else. Oh, and they're also overhyped
ZDNETnews

Wider coverage of Gartner's agentic AI cancellation forecast; 95% of business AI applications have failed.

↩
MIT report: 95% of generative AI pilots at companies are failing
Fortunenews

MIT NANDA initiative found 95% of enterprise generative AI pilots delivered zero measurable P&L impact.

↩
Why AI Agents Fail in Production: The Reliability Gap in 2026
Inovabeinganalysis

Agents tested against happy paths fail in the 30-40% of production interactions made up of edge cases.

↩
OWASP Top 10 for Agents 2026
DeepTeam / OWASPdocs

Top agentic AI failure modes include goal misalignment, tool misuse, delegated trust, persistent memory poisoning, and emergent autonomous behavior.

↩
Why AI Agents Fail in Production (And How Engineering Teams Are Fixing It in 2026)
DEV Communityanalysis

Production incidents from AI agents are often missed by traditional observability and don't fit existing postmortem templates.

↩
Project Vend: Phase two
Anthropicprimary

Phase two of the experiment with a smarter agent and tighter guardrails still lost money.

↩

ai-agentsindustry-newsanthropicai-pilotsagent-failure

Anthropic Let an AI Run a Shop. It Hired a Blue Blazer.

The vending machine that lost its mind

This isn't about Anthropic. It's about you.

Most takes on this are wrong

What this changes for a $5M operator

The real lesson from Claudius

What to do this week

Ready to build your own AI system?

Keep Reading

74% of Enterprises Rolled Back Their AI Agents. Here's What They Did Wrong.

Shopify Just Opted 5.6M Stores Into AI Shopping. Most Don't Even Know.

Per-Seat SaaS Pricing Is Dying. Here's What's Replacing It.