The Practical State of AI Agents in Mid-2024
The agent conversation has moved from hype to deployment in some categories and remains hype in others. A practitioner snapshot of where agents are actually working and where they are still demos.
The AI agent conversation in mid-2024 is louder than ever. Frameworks proliferate. Demos are striking. The case for "agents will change enterprise work" is made in every keynote. The picture in production is more nuanced: agents are working in some categories, are not working in others, and the dividing line is becoming clearer.
This is a practitioner snapshot of where AI agents are actually shipping in enterprise contexts in mid-2024 — what is working, what is not, and what the bridge to broader adoption actually requires.
Where agents are working
The categories where production agent deployments are showing real value:
Code agents in development workflows
The most successful agent category to date. The agent operates inside the development environment, with access to the codebase, tests, and execution. Tasks like "implement this feature based on this issue" or "fix this failing test" produce useful results often enough to be worth the cost.
What makes it work:
- Bounded scope — the agent works on a specific issue or task, not on open-ended objectives
- Fast feedback — tests run; outputs are validated; errors are immediate
- Human review at the PR — the agent's output is reviewed before merge
- Strong existing tooling — the development environment is rich with feedback signals
The result: agents that contribute usefully to enumerated tasks, with human review as the quality gate.
Customer support augmentation
Not autonomous chatbots — agent-assisted support where the agent navigates the knowledge base, drafts responses, and the human reviews and sends. Same shape as code agents in development: the agent does the lookup and drafting; the human is the editor.
Sales and CRM workflows
Sales agents that prepare meeting briefings, summarise account histories, draft follow-up emails. The agent reads from CRM, calendars, prior correspondence; produces a synthesis. The salesperson reviews and uses.
Document workflow agents
The agent receives an incoming document (an email, a contract, a regulatory filing), extracts structured information, drafts a response or next action, queues for human approval. The agent does the structured work; the human approves.
IT operations augmentation
Agents that triage incoming tickets, gather context from monitoring systems, draft initial response, escalate where appropriate. Useful especially for level-1 IT support and similar high-volume diagnostic work.
The pattern across these: bounded scope, human checkpoints, rich feedback signals, and clear hand-off points.
Where agents are not working
The categories where deployment is still mostly aspirational:
Open-ended task agents
"Plan my product launch." "Build me a marketing campaign." The agent is given a goal and a wide scope; the loop continues until satisfaction. In practice these often fail to converge, produce shallow output, or consume too much budget to be economical.
The issue is not capability so much as feedback signal. Without rich feedback at each step, the agent has no way to know whether it is making progress.
Multi-agent collectives debating
Agents talking to each other to reach consensus, brainstorm, refine. The demos are entertaining; the production cases either produce one-shot solutions wrapped in unnecessary multi-agent overhead or produce conversations that don't terminate.
Long-horizon autonomous work
"Run my email" or "manage my calendar" with full autonomy. The error tolerance for autonomous personal-assistant work is very low; current capability does not meet it for most users.
Domain-specific reasoning without grounding
An agent asked to do specialised work — legal research, medical diagnosis, financial analysis — that requires deep domain knowledge often produces plausible-sounding output that experts can identify as wrong. Without strong grounding and human review, these are not production-grade.
Critical-action agents
Agents that take consequential actions autonomously — purchasing, transacting, communicating with customers. The risk profile of these is currently too high for autonomous operation. They work as drafting-and-approval pairs; they don't work as autonomous actors.
The pattern that distinguishes them
A consistent shape across working and non-working categories:
Working agents have:
- Bounded scope per request
- Rich feedback signals (tests pass/fail, validation succeeds/fails, downstream system accepts/rejects)
- A human checkpoint before consequential action
- A clear hand-off when the agent cannot proceed
Not-working agents have:
- Open-ended scope
- Weak or no feedback per step
- Pure autonomy on consequential outcomes
- No graceful failure mode
The distinction is not about the model; the same models work in one shape and fail in the other.
The bridge to broader adoption
What the not-yet-working categories need to become working:
Better feedback signals
For autonomous work in domains where there isn't natural feedback (no tests, no validators), the work has to be in instrumenting feedback. What constitutes a good output? Can it be measured? If not, the agent cannot improve through iteration.
Tooling for human supervision
The right interaction model isn't full autonomy or full manual; it's supervised autonomy where the agent works and the human reviews efficiently. The tooling for efficient review — surfacing what the agent did, why, with what alternatives — is largely unbuilt.
Domain-specific grounding
Off-the-shelf models reason from general training. Domain-specific reasoning needs domain-specific grounding — knowledge bases, structured data, examples. Without this, the agent's domain capability is shallow.
Bounded autonomy
The pattern that works isn't "agent does everything"; it is "agent does enumerated steps with human approval at key checkpoints." Bounding the autonomy at the right points is design work that needs to happen per workflow.
Risk-tiered deployment
Critical actions stay manual; routine actions automate; intermediate actions have human approval. Tiering the work by risk and applying agency at the right tier is the discipline that turns agents from research curiosities into production tools.
The framework landscape in mid-2024
Briefly, the agent frameworks teams are picking up:
- LangGraph — graph-based orchestration, controllable, suited for production
- OpenAI Assistants API — Anthropic and OpenAI's hosted approaches, easier to start, less flexible
- CrewAI — opinionated multi-agent framework, demo-friendly
- Custom orchestration — what most production teams end up with
- Domain-specific frameworks — e.g., AI software engineering tools like Cursor, Devin (announced; capability still emerging)
The framework choice matters less than the design discipline. A team with discipline ships with any framework; a team without discipline struggles with the best framework.
What we keep seeing
Recurring patterns in enterprise agent deployments:
The bounded ones ship; the open-ended ones don't. This is the most reliable pattern. Bounded scope is the predictor.
Feedback infrastructure determines quality. Where the team has invested in evaluating agent outputs, the quality improves. Where it hasn't, quality drifts.
Human-supervised flows are the production pattern. Pure autonomy remains experimental for most consequential workloads. The supervised flow is where the value is captured.
Cost is real and underestimated. Agent loops produce more model calls than single-shot interactions. Without budgeting and circuit breakers, the bills are unpleasant.
Adoption depends on integration. Agents that integrate with the team's existing tools (IDE, ticketing, CRM, email) get adopted. Agents that are separate surfaces don't.
What we recommend
For enterprise teams considering agent deployments in 2024:
- Start with bounded workloads. Open-ended is research; bounded is production.
- Identify the feedback signals before building the agent. Without them, the agent cannot improve.
- Design the human supervision interaction carefully. The efficient-review surface is the asset.
- Apply risk-tiered autonomy. Routine work autonomous; consequential work supervised; critical work manual.
- Budget the cost. Set circuit breakers. Agent loops can run away.
- Integrate with existing tools. Separate surfaces lose adoption.
- Measure against alternatives. Sometimes a non-agent solution is the right answer.
AI agents in 2024 are a real category in some shapes and an aspirational category in others. The teams that ship reliably are the ones that understand the difference and choose the bounded shape. The teams that chase the autonomous demo end up with systems that perform well in showcase environments and not in production. The capability will broaden over the coming years; the shape of what works will keep refining.
RELATED READING
More from the field.
Service practices the article draws on, related programmes, and other pieces on adjacent topics.
Service practices
Related pieces
Agent Infrastructure Catches Up — The Production Stack in 2025
Agent infrastructure was the gap a year ago. In 2025 the stack has matured enough that production deployment is a reasonable expectation, not a research bet.
Function Calling — Production Patterns for Enterprise
Function calling turned LLMs from text producers into action takers. The production patterns are constrained: a tight function catalogue, careful permission modelling, robust argument validation, and explicit human checkpoints for irreversible actions.
Computer Use and Browser Agents — Where the Threshold Sits
Anthropic's Computer Use, browser-control demos from OpenAI and others — the agentic-AI-controls-the-screen pattern has crossed a threshold in late 2024. What's actually production-ready is much narrower than the demos.
Discuss this work
Bring an enterprise programme.
If anything in this piece resonates with what you're building, talk to us. Senior practitioners engage directly on architecture and delivery.
Work with the practitioners
Bring an enterprise programme.
Architecture audit, new delivery, modernisation, or in-flight rescue — Intellectual engages directly on enterprise programmes with senior practitioners.