Function Calling — Production Patterns for Enterprise
Function calling turned LLMs from text producers into action takers. The production patterns are constrained: a tight function catalogue, careful permission modelling, robust argument validation, and explicit human checkpoints for irreversible actions.
When function calling matured in the second half of 2023, the LLM stopped being a text producer and became, to a real extent, an action taker. A user asks a question; the model decides which function to call; the application executes the call; the result feeds back to the model, which produces the user-facing response. The pattern is now a year old in production, and the discipline that makes it work in enterprise environments is clearer than it was.
This piece is a practitioner view of production function-calling patterns — what works, what fails, and where the design effort actually pays off.
What function calling is
The model is given a catalogue of functions, each described with a name, a purpose, and a parameter schema. The user makes a request. The model decides whether to respond directly or to call a function. If it calls a function, the application executes the call, returns the result to the model, and the model either produces a user-facing response or calls another function.
The pattern unlocks workloads that previously required either deterministic code or human routing. "Schedule the meeting with the team after lunch," "look up the customer's order history and tell me about the recent returns," "draft a permit application from this email."
The pattern is also where the most expensive enterprise AI mistakes happen — agents taking actions they shouldn't, calls with wrong arguments, irreversible operations that should have required human confirmation.
The function catalogue as governed surface
The first principle: the function catalogue is a governed surface. Each function is an action the LLM can take. Adding a function is a security and operational decision, not a casual extension.
What this means in practice:
- Function approval process. A new function goes through review. What does it do, who can invoke it, what are the side effects, what is the blast radius.
- Function registry. A central catalogue with descriptions, owners, permission models, audit log requirements. The registry is the source of truth for what the LLM can do.
- Versioning. Functions evolve. The schema changes, the behaviour changes. The versions are tracked; LLM-callable versions are pinned.
- Deprecation policy. Removing or changing a function has consequences for any prompt that relies on it. Deprecation follows the same pattern as API deprecation.
A team that treats function additions casually accumulates exposure they will regret. A team that treats the catalogue as a governed surface scales the capability without compounding risk.
Schema as prompt
The function schema is not metadata; it is part of the prompt. The model uses the schema to decide which function to call and what arguments to use.
This means schema design is prompt engineering:
- Function names that describe purpose. A function called
lookupis ambiguous;lookup_customer_by_emailis not. - Parameter names that describe content. A parameter called
idis ambiguous;customer_idis not. - Parameter descriptions in the schema. Most schema formats allow descriptions; use them. "The customer's ID, as a string starting with 'CUST-' followed by digits" beats no description.
- Required vs optional parameters. Mark them correctly. Missing required parameters cause failures; unnecessary required parameters cause the model to omit calls it should make.
- Enum constraints where appropriate. A parameter that should be one of "approved", "denied", "pending" should have those as enum values, not free-form strings.
Schema quality has a measurable effect on call quality. Investment here pays off across every prompt that uses the functions.
Permission propagation
When the LLM calls a function on a user's behalf, the function executes with that user's permissions, not with the application's permissions.
This sounds obvious. In practice it requires explicit design:
- The user identity is captured at the start of the session.
- Every function call carries the user identity.
- The function execution layer enforces the user's permissions before executing the call.
- Audit logs record the user identity, the function call, the arguments, the result.
A common anti-pattern: the application is granted broad permissions; the LLM can call any function on any data, regardless of who the user is. This collapses access control. The application becomes a bypass for the access-control model.
The discipline is to propagate the user's identity to every function call and enforce permissions at the function-execution layer. This requires identity infrastructure that may not be present in early prototypes; building it in is necessary for production.
Argument validation
The model produces function arguments. The arguments need to be validated before the function executes.
What validation includes:
- Schema conformance. The arguments conform to the function's schema.
- Allowlists. If a parameter takes an identifier, the identifier exists. If a parameter takes a date, the date is in a valid range.
- Cross-parameter constraints. Start date before end date, total equals sum of items, the parameters together form a coherent request.
- Business rules. The combination of arguments doesn't violate domain constraints (no negative amounts, no future dates for past operations).
Validation runs in deterministic code, not in the LLM. The LLM proposes; the validator disposes. A failed validation does not silently retry or invent corrected arguments; it returns a clear error to the model, which can ask the user for clarification or fail gracefully.
Idempotency and irreversibility
A function exposed to an LLM should be idempotent where possible. The model may retry; calls may duplicate; idempotency makes this safe.
For functions that are not idempotent, the discipline is sharper:
- Read functions — no irreversibility concerns. Free to call.
- Reversible write functions — fine to call. Audit thoroughly.
- Irreversible functions — require explicit human confirmation. The LLM does not invoke them directly; it proposes the action; a human approves.
What counts as irreversible:
- Sending external communications (emails, messages, notifications)
- Financial transactions
- Deletions
- Operations on regulated workflows (filing, approving, denying)
- External API calls with side effects
The model can prepare the action; the human commits it. Without this discipline, the first agent error produces an external consequence that cannot be undone.
Loop discipline
A function-calling system often runs in a loop — the model calls a function, sees the result, decides what to do next, may call another function, and so on. The loop must be bounded.
- Step limit. A maximum number of function calls per user request. Reaching the limit triggers a graceful failure, not an infinite loop.
- Cost limit. A maximum cost per request. Reaching the limit triggers a graceful failure.
- Wall-clock limit. A maximum time per request. For long-running operations, the request becomes a background job rather than a blocking interaction.
- Sanity checks. If the model is making repeated calls to the same function with similar arguments, something is wrong. Break the loop.
The limits are circuit breakers, not aspirations. The cost of an unbounded loop is unbounded.
Observability
Every function call is a small distributed transaction. The trace has to be complete enough to reconstruct what happened:
- The user's request
- The LLM's reasoning (where available)
- The function chosen
- The arguments
- The validation result
- The function's response
- Any errors
- The model's next action
This is more than conventional API logging. The LLM's reasoning, the prompt and response at each turn, the full trace of function calls — all are part of the audit record. Without them, debugging is impossible and post-incident analysis is incomplete.
What we keep seeing
Recurring patterns in production function-calling deployments:
Function catalogues that grew faster than the governance. Six months of casual additions later, the team has eighty functions and no real model of what the LLM can do. Catalogue triage and consolidation become necessary.
Schema quality determining call quality. Teams notice that improving descriptions in the function schemas improves call accuracy substantially. Schema engineering is real engineering.
Permission gaps. Functions added with the application's permissions rather than the user's. Discovered when a user asks for data they shouldn't have access to and gets it.
Loop runaway. Without strict limits, a poorly-formed request produces dozens of function calls. Cost surprises. Limits get added.
Confusion between exposure and capability. Just because a function exists doesn't mean every LLM should be able to call it. Different agents may need different catalogues based on their role and the user they serve.
What we recommend
For an enterprise team adding function calling to LLM systems in 2024:
- Treat the function catalogue as a governed surface. Approval, registry, versioning, deprecation.
- Design schemas as prompt artifacts. Names, descriptions, enums — all matter.
- Propagate user identity to every call. Enforce permissions at execution.
- Validate arguments deterministically before executing.
- Distinguish read, reversible-write, and irreversible functions. Human checkpoint for irreversible.
- Bound the loop. Steps, cost, time. Enforce as circuit breakers.
- Capture full observability traces from day one.
Function calling is one of the most useful primitives in current LLM systems. It is also where the most significant production risks live. The discipline that keeps it safe is not complicated, but it has to be applied consistently. The teams that ship reliable function-calling systems are the ones that respect this from the start.
RELATED READING
More from the field.
Service practices the article draws on, related programmes, and other pieces on adjacent topics.
Service practices
Related pieces
The Practical State of AI Agents in Mid-2024
The agent conversation has moved from hype to deployment in some categories and remains hype in others. A practitioner snapshot of where agents are actually working and where they are still demos.
Agent Infrastructure Catches Up — The Production Stack in 2025
Agent infrastructure was the gap a year ago. In 2025 the stack has matured enough that production deployment is a reasonable expectation, not a research bet.
From AI Pilot to Production — The Playbook That Bridges the Gap
Every enterprise has AI pilots. Far fewer have AI in production. The bridge between the two is more about organisational discipline than technical capability. A practitioner playbook.
Discuss this work
Bring an enterprise programme.
If anything in this piece resonates with what you're building, talk to us. Senior practitioners engage directly on architecture and delivery.
Work with the practitioners
Bring an enterprise programme.
Architecture audit, new delivery, modernisation, or in-flight rescue — Intellectual engages directly on enterprise programmes with senior practitioners.