AI & Enterprise AI16 July 20248 min read

The Practical State of AI Agents in Mid-2024

The agent conversation has moved from hype to deployment in some categories and remains hype in others. A practitioner snapshot of where agents are actually working and where they are still demos.

ByIntellectual AI Engineering Practice· Collective byline

The AI agent conversation in mid-2024 is louder than ever. Frameworks proliferate. Demos are striking. The case for "agents will change enterprise work" is made in every keynote. The picture in production is more nuanced: agents are working in some categories, are not working in others, and the dividing line is becoming clearer.

This is a practitioner snapshot of where AI agents are actually shipping in enterprise contexts in mid-2024 — what is working, what is not, and what the bridge to broader adoption actually requires.

Where agents are working

The categories where production agent deployments are showing real value:

Code agents in development workflows

The most successful agent category to date. The agent operates inside the development environment, with access to the codebase, tests, and execution. Tasks like "implement this feature based on this issue" or "fix this failing test" produce useful results often enough to be worth the cost.

What makes it work:

Bounded scope — the agent works on a specific issue or task, not on open-ended objectives
Fast feedback — tests run; outputs are validated; errors are immediate
Human review at the PR — the agent's output is reviewed before merge
Strong existing tooling — the development environment is rich with feedback signals

The result: agents that contribute usefully to enumerated tasks, with human review as the quality gate.

Customer support augmentation

Not autonomous chatbots — agent-assisted support where the agent navigates the knowledge base, drafts responses, and the human reviews and sends. Same shape as code agents in development: the agent does the lookup and drafting; the human is the editor.

Sales and CRM workflows

Sales agents that prepare meeting briefings, summarise account histories, draft follow-up emails. The agent reads from CRM, calendars, prior correspondence; produces a synthesis. The salesperson reviews and uses.

Document workflow agents

The agent receives an incoming document (an email, a contract, a regulatory filing), extracts structured information, drafts a response or next action, queues for human approval. The agent does the structured work; the human approves.

IT operations augmentation

Agents that triage incoming tickets, gather context from monitoring systems, draft initial response, escalate where appropriate. Useful especially for level-1 IT support and similar high-volume diagnostic work.

The pattern across these: bounded scope, human checkpoints, rich feedback signals, and clear hand-off points.

Where agents are not working

The categories where deployment is still mostly aspirational:

Open-ended task agents

"Plan my product launch." "Build me a marketing campaign." The agent is given a goal and a wide scope; the loop continues until satisfaction. In practice these often fail to converge, produce shallow output, or consume too much budget to be economical.

The issue is not capability so much as feedback signal. Without rich feedback at each step, the agent has no way to know whether it is making progress.

Multi-agent collectives debating

Agents talking to each other to reach consensus, brainstorm, refine. The demos are entertaining; the production cases either produce one-shot solutions wrapped in unnecessary multi-agent overhead or produce conversations that don't terminate.

Long-horizon autonomous work

"Run my email" or "manage my calendar" with full autonomy. The error tolerance for autonomous personal-assistant work is very low; current capability does not meet it for most users.

Domain-specific reasoning without grounding

An agent asked to do specialised work — legal research, medical diagnosis, financial analysis — that requires deep domain knowledge often produces plausible-sounding output that experts can identify as wrong. Without strong grounding and human review, these are not production-grade.

Critical-action agents

Agents that take consequential actions autonomously — purchasing, transacting, communicating with customers. The risk profile of these is currently too high for autonomous operation. They work as drafting-and-approval pairs; they don't work as autonomous actors.

The pattern that distinguishes them

A consistent shape across working and non-working categories:

Working agents have:

Bounded scope per request
Rich feedback signals (tests pass/fail, validation succeeds/fails, downstream system accepts/rejects)
A human checkpoint before consequential action
A clear hand-off when the agent cannot proceed

Not-working agents have:

Open-ended scope
Weak or no feedback per step
Pure autonomy on consequential outcomes
No graceful failure mode

The distinction is not about the model; the same models work in one shape and fail in the other.

The bridge to broader adoption

What the not-yet-working categories need to become working:

Better feedback signals

For autonomous work in domains where there isn't natural feedback (no tests, no validators), the work has to be in instrumenting feedback. What constitutes a good output? Can it be measured? If not, the agent cannot improve through iteration.

Tooling for human supervision

The right interaction model isn't full autonomy or full manual; it's supervised autonomy where the agent works and the human reviews efficiently. The tooling for efficient review — surfacing what the agent did, why, with what alternatives — is largely unbuilt.

Domain-specific grounding

Off-the-shelf models reason from general training. Domain-specific reasoning needs domain-specific grounding — knowledge bases, structured data, examples. Without this, the agent's domain capability is shallow.

Bounded autonomy

The pattern that works isn't "agent does everything"; it is "agent does enumerated steps with human approval at key checkpoints." Bounding the autonomy at the right points is design work that needs to happen per workflow.

Risk-tiered deployment

Critical actions stay manual; routine actions automate; intermediate actions have human approval. Tiering the work by risk and applying agency at the right tier is the discipline that turns agents from research curiosities into production tools.

The framework landscape in mid-2024

Briefly, the agent frameworks teams are picking up:

LangGraph — graph-based orchestration, controllable, suited for production
OpenAI Assistants API — Anthropic and OpenAI's hosted approaches, easier to start, less flexible
CrewAI — opinionated multi-agent framework, demo-friendly
Custom orchestration — what most production teams end up with
Domain-specific frameworks — e.g., AI software engineering tools like Cursor, Devin (announced; capability still emerging)

The framework choice matters less than the design discipline. A team with discipline ships with any framework; a team without discipline struggles with the best framework.

What we keep seeing

Recurring patterns in enterprise agent deployments:

The bounded ones ship; the open-ended ones don't. This is the most reliable pattern. Bounded scope is the predictor.

Feedback infrastructure determines quality. Where the team has invested in evaluating agent outputs, the quality improves. Where it hasn't, quality drifts.

Human-supervised flows are the production pattern. Pure autonomy remains experimental for most consequential workloads. The supervised flow is where the value is captured.

Cost is real and underestimated. Agent loops produce more model calls than single-shot interactions. Without budgeting and circuit breakers, the bills are unpleasant.

Adoption depends on integration. Agents that integrate with the team's existing tools (IDE, ticketing, CRM, email) get adopted. Agents that are separate surfaces don't.

What we recommend

For enterprise teams considering agent deployments in 2024:

Start with bounded workloads. Open-ended is research; bounded is production.
Identify the feedback signals before building the agent. Without them, the agent cannot improve.
Design the human supervision interaction carefully. The efficient-review surface is the asset.
Apply risk-tiered autonomy. Routine work autonomous; consequential work supervised; critical work manual.
Budget the cost. Set circuit breakers. Agent loops can run away.
Integrate with existing tools. Separate surfaces lose adoption.
Measure against alternatives. Sometimes a non-agent solution is the right answer.

AI agents in 2024 are a real category in some shapes and an aspirational category in others. The teams that ship reliably are the ones that understand the difference and choose the bounded shape. The teams that chase the autonomous demo end up with systems that perform well in showcase environments and not in production. The capability will broaden over the coming years; the shape of what works will keep refining.

More from the field.

Service practices the article draws on, related programmes, and other pieces on adjacent topics.

Service practices

Service

AI & Intelligent Automation

/services/ai-solutions →

Service

Enterprise Integration & API Management

/services/enterprise-integration →

Related pieces

4 February 20257 min read

Agent Infrastructure Catches Up — The Production Stack in 2025

Agent infrastructure was the gap a year ago. In 2025 the stack has matured enough that production deployment is a reasonable expectation, not a research bet.

9 April 20248 min read

Function Calling — Production Patterns for Enterprise

Function calling turned LLMs from text producers into action takers. The production patterns are constrained: a tight function catalogue, careful permission modelling, robust argument validation, and explicit human checkpoints for irreversible actions.

29 October 20247 min read

Computer Use and Browser Agents — Where the Threshold Sits

Anthropic's Computer Use, browser-control demos from OpenAI and others — the agentic-AI-controls-the-screen pattern has crossed a threshold in late 2024. What's actually production-ready is much narrower than the demos.

Industry

Government & Public Sector

Regulatory platforms, citizen services, and federal-grade integration.

Discuss this work

Bring an enterprise programme.

If anything in this piece resonates with what you're building, talk to us. Senior practitioners engage directly on architecture and delivery.

Contact Intellectual →

← Newer post

From AI Pilot to Production — The Playbook That Bridges the Gap

Older post →

Multimodal AI in the Enterprise — Where Vision Plus Text Earns Its Cost

Work with the practitioners

Bring an enterprise programme.

Architecture audit, new delivery, modernisation, or in-flight rescue — Intellectual engages directly on enterprise programmes with senior practitioners.

Contact Intellectual →Read more insights