Intellectual
← All Insights
AI & Enterprise AI6 February 20249 min read

Intelligent Document Processing — From OCR to Understanding

Intelligent document processing has changed shape in the last eighteen months. A practitioner view of where the real work sits when LLMs join the pipeline — and why parsing still matters more than the model.

Intelligent document processing — the discipline of getting structured information out of unstructured or semi-structured documents — has changed shape in the last eighteen months. The OCR step is still there; the form-recognition step is still there; the validation and human-review steps are still there. What is new is that an LLM now sits in the pipeline as one of the most flexible reasoning steps in the chain.

This piece is a practitioner view of what IDP looks like in 2024 when designed deliberately, and where the work that actually determines quality sits.

The shape of a modern IDP pipeline

A working pipeline for processing, say, a regulatory permit application packet looks like:

  1. Document intake — files arrive, get queued, get classified by type
  2. OCR and layout parsing — text and structure extracted, with positional information
  3. Page-level classification — what each page is (cover letter, form, attachment, photo evidence)
  4. Form recognition — for known forms, fields are located and read
  5. LLM-assisted extraction — for unstructured or variable content, the LLM extracts structured fields
  6. Validation — extracted data checked against business rules
  7. Human review — for cases that fail confidence thresholds or business rules
  8. Persistence and integration — clean structured data flows into the system of record
  9. Audit trail — every step recorded with the inputs, the outputs, the confidence scores

Each step matters. The LLM is the most flexible step but not the most important. The OCR and layout parsing step quietly determines whether the rest of the pipeline produces clean output or garbage.

What changed when LLMs arrived

Pre-LLM IDP pipelines did extraction in two ways:

  • Template-based extraction for known forms — fast, accurate, brittle when forms change
  • Model-based extraction for unknown forms — slower, less accurate, required training data per form type

The LLM offers a third path: extraction by description rather than by template. You describe the fields you want; the LLM finds them across documents of varying layouts. This works well enough that the calculus of when to build a template versus when to use general-purpose extraction has shifted decisively.

What it does not change:

  • OCR quality is still the foundation. A bad OCR pass produces text the LLM cannot reliably interpret. Investment in the parsing layer matters more, not less.
  • Validation discipline is still required. The LLM produces plausible-looking output. Plausible is not correct. Business-rule validation on extracted fields is non-negotiable.
  • Human review is still the production bar. No production IDP system runs without escalation paths for low-confidence cases.

OCR and layout parsing — still where it starts

The parsing layer has matured significantly. The options now:

  • Cloud APIs — Azure Document Intelligence (formerly Form Recognizer), AWS Textract, Google Document AI. All three handle modern documents well, including tables and forms.
  • Specialised commercial products — ABBYY, Hyperscience, others. Strongest where layout complexity is high.
  • Open-source parsers — pdfplumber, PyMuPDF, Tesseract for OCR, Unstructured for end-to-end layout-aware parsing. Closing the gap on cloud APIs for many document types.

The choice depends on document profile:

  • Born-digital PDFs — open-source parsers are often sufficient
  • Scanned documents — cloud APIs or commercial products produce more reliable output
  • Forms with structured fields — Azure Document Intelligence or Textract have built-in form recognition that is hard to beat
  • Complex layouts (multi-column, sidebars, tables spanning pages) — commercial products tend to handle these better

What we keep seeing: teams that adopt the LLM enthusiastically while continuing to use a weak parser. The LLM does its best with garbled input, but the ceiling is set by what the parser produces. Investing in the parsing layer often improves quality more than switching to a more capable model.

Tables and complex layouts

Tables are the hardest common case. A document with a financial statement table, where rows have different indentation levels, columns have spanning headers, and totals row across the bottom, is non-trivial to extract correctly.

What works in practice:

  • Use layout-aware parsing. A parser that produces bounding boxes and table structure produces output the LLM can interpret. Plain text produces ambiguity.
  • Process tables separately. Identify tables in the layout step, extract them as structured data, then feed the structured data to the LLM. Don't ask the LLM to interpret raw text that contains tables; the geometry information is gone.
  • Preserve column headers. A row in a table only makes sense in context of its column headers. The extraction needs to keep the relationship.

Where tables matter heavily — financial documents, regulatory submissions, lab reports — invest in the table extraction step. The LLM is downstream of it; it cannot recover what the parser destroyed.

The LLM extraction step

The pattern that works:

  1. Define the extraction schema explicitly. A pydantic model, a JSON schema, a structured prompt that names every field.
  2. Pass the relevant content with layout context. Either the layout-preserving text, or specific structured fragments (this table, this section), or page images for multimodal models where appropriate.
  3. Use function calling or structured-output APIs. Don't free-form the LLM; constrain it to the schema.
  4. Capture confidence. Some models produce per-field confidence; otherwise, ask the model to flag uncertain fields. Use this for downstream routing.
  5. Validate output against the schema. If the LLM produces invalid output, retry with a corrective prompt before failing.

What teams underestimate:

  • The schema is part of the prompt. Vague field names produce vague extractions. The schema should make it unambiguous what each field is.
  • Few-shot examples carry weight. Showing the model two or three correct extractions improves quality substantially on a new document type. The examples are part of the cost of deployment.
  • Long documents need decomposition. A 100-page document doesn't fit in a useful prompt. Process by section, then assemble.

Validation — where the rubber meets the road

Extraction confidence does not equal correctness. A model can be confidently wrong. Validation against business rules is what catches this.

Examples:

  • An extracted date should be in a valid range.
  • An extracted monetary amount should match the totals on the document.
  • An extracted entity name should exist in the customer master.
  • An extracted regulatory code should be a valid code in the current code list.
  • Cross-field consistency — start date before end date, total equals sum of line items.

When validation fails, the case routes to human review. Without validation, errors propagate into the system of record and surface weeks later as data quality incidents.

Human review

Production IDP systems run with humans in the loop. The design problem is making review efficient.

  • Confidence-tiered routing. High-confidence extractions go straight through. Mid-confidence go to review. Low-confidence go to specialist review.
  • Pre-populated review screens. The reviewer sees the document and the extraction side by side; they correct what is wrong rather than typing from scratch.
  • Feedback loop. Corrections feed back into the system. Either as training signal for fine-tuned models, as eval-set additions, or as prompt refinement input. The feedback loop is the difference between a system that improves and one that stagnates.
  • Audit trail. Every reviewer action recorded. For compliance and for understanding where the system is making mistakes.

Teams that design the review experience as a first-class part of the system see throughput double over teams that treat review as a fallback.

Audit and lineage

Regulated IDP systems have to defend their extractions. That means recording, for every field in every document:

  • The source page and position the extraction came from
  • The OCR text the extraction was based on
  • The model version that produced the extraction
  • The confidence and any flags
  • Any human corrections, with reviewer identity and timestamp

This is a non-trivial amount of metadata to capture and store. It is also the difference between a system that can be audited and one that cannot. Capture it from day one.

Performance and cost

IDP at scale is a cost question. The factors:

  • OCR is per-page, not free. Cloud OCR APIs charge per page. At enterprise volume this is material.
  • LLM calls scale with document complexity. Long documents, many extractions, retries on validation failures all add tokens.
  • Caching is harder than in chat workloads. Each document is unique; the cache hit rate on extraction calls is low. But cached OCR output across re-runs, cached embeddings for known document types, cached few-shot prompts — these all matter.
  • Batch processing has different economics than real-time. Batch APIs from model providers offer significant discounts when latency is not critical.

Build cost monitoring at the document level. Cost per document is a clean metric that surfaces problems quickly.

What we keep seeing

Recurring patterns in production IDP deployments:

Investment imbalance. Teams put 80% of effort on the LLM extraction step and 20% on parsing and validation. The output quality is bounded by the steps that get less attention.

Schema sprawl. A schema per use case, evolving independently. After a year there are forty schemas with overlapping fields. Consolidate.

Confidence calibration drift. A confidence threshold tuned at launch may produce too many or too few review cases six months later as document mix changes. Monitor and retune.

Lack of evaluation discipline. Every change to the pipeline — new parser, new model, new prompt — gets evaluated on the eval set. Teams that skip this introduce silent regressions.

What we recommend

For an enterprise IDP initiative in 2024:

  1. Invest in the parsing layer first. OCR and layout quality are the foundation; the LLM cannot recover what the parser destroys.
  2. Define the schema explicitly. Vague schemas produce vague extractions.
  3. Validate aggressively against business rules. Confidence is not correctness.
  4. Design the review experience as part of the product, not as a fallback.
  5. Build the audit trail from day one.
  6. Monitor cost per document. It surfaces issues fast.
  7. Run evaluation on every change. Without it, regressions are silent.

Intelligent document processing is now an LLM-augmented discipline, but the discipline matters more than the LLM. The teams that ship are the ones that respect the engineering work around the model.

Work with the practitioners

Bring an enterprise programme.

Architecture audit, new delivery, modernisation, or in-flight rescue — Intellectual engages directly on enterprise programmes with senior practitioners.