Deep dive · E-commerce

AI Agents for E-commerce Operations: a Practical Architecture

Architecture for an AI agent that runs e-commerce ops, order triage, supplier sync, refund handling, without breaking when an upstream API changes.

deep-dive

An AI agent for e-commerce operations is not a chatbot in the storefront. It is the layer behind the storefront, the one that reads inbox emails, reconciles supplier feeds, drafts refund decisions, and escalates the cases that need a human. Built well, it absorbs the predictable 80% of ops work and surfaces the rest. Built badly, it makes confident wrong decisions on the parts that matter most.

Architecture, not a single prompt

The mistake is to think of the agent as an LLM call. The agent is an architecture: a queue of inbound events, a planner that decides which tool to call, a set of typed tools that read and write the underlying systems, and an audit log that records every decision. The LLM is one component, usually the planner, and it is hot-swappable.

Concretely: webhook lands an event in a queue. A planner reads the event, queries the order, the customer history, and the supplier feed, then decides on an action. Actions are not free-text; they are a small set of typed functions, issue refund, escalate to human, request more information from customer, mark as fraud, each of which has its own validation, side effects, and audit row. The agent is constrained to that set on purpose.

The tools the agent actually needs

Five tools cover most e-commerce ops. A read tool over the order graph (order, customer, payment, shipment). A read tool over the supplier feed and stock. A write tool that issues refunds and credit notes through the payment processor. A messaging tool that sends customer-facing emails or WhatsApp messages, scoped to a small set of templates. And an escalation tool that opens a ticket for a human, with full context attached.

Each tool wraps a real API, Shopify, Stripe, Klaviyo, the supplier's system, with a typed adapter, retries, and idempotency. The agent never calls the underlying API directly; the typed layer is what keeps the system honest when an upstream API changes.

Where the agent must hand off

Three categories always escalate. Fraud-suspect transactions, regardless of confidence. Refunds above a configurable threshold (start low, raise with evidence). Customer messages with detected emotion above a threshold, angry, distressed, legal, go to a human inside one minute. The agent's job in those cases is not to be silent; it is to acknowledge, surface the case, and stop.

What kills these systems is over-confident automation on the long tail. The right defaults are conservative: small refunds automatic, mid-size refunds human-approved, large refunds locked behind two-factor approval. Loosen the limits with evidence, never on faith.

Instrumentation and the case for boring

The KPIs are operational. Cases handled per FTE, percentage handled without escalation, percentage of automated decisions reversed by humans, customer satisfaction on agent-handled cases versus human-handled. Watch the reversal rate especially, anything above a small percent means the agent is getting things wrong with confidence, which is worse than no agent.

The boring conclusion: the systems that work in 2026 are typed, narrow, and instrumented. They are not impressive demos. The MentorDada engagement we shipped is a different vertical, education and content, but the same architectural discipline: typed code, real tests, observable behaviour.

Where to read more

For a market-specific view, Shopify automation for London teams covers what an engagement looks like. The answer page on AI agents vs chatbots explains the architectural distinction in tighter form.

Send a short note describing your current ops volume and the part you would most like to take off your team. We reply within one working day.

Talk to Syncraft

One workflow, four weeks, measurable lift.

Send a short note about the process you want to automate and the metric you want to move. We respond within one working day with a fit assessment, rough scope, and price range.