Building a Self-Improving Document Classification Agent on Azure AI Foundry

"We get thousands of documents a week. Half of them are the same three types, but they still take up a chunk of somebody's time to classify and route. Can AI do this for us?"

It's the most common agentic AI conversation we're having with clients in 2026. And it's a perfect use case for modern agentic AI — not because LLMs are flawless, but because you can build a system where they learn from the humans correcting them, and get measurably better over time.

This post walks through how we'd build exactly that on Azure AI Foundry, using persistent agents, Azure Functions for hosting, Cosmos DB with vector search as memory, human-in-the-loop for approval, and OpenTelemetry tracing flowing into Foundry's observability stack. The result is an agent that classifies documents autonomously for the easy cases, asks for human approval when it's uncertain, and continuously improves as humans correct it.

The scenario

A client receives business documents — contracts, invoices, correspondence, compliance notices, purchase orders — through multiple channels. Each one needs to be classified, the key fields extracted, and the document routed to the right team. Historically, one or two people spent half their day doing this by hand.

The requirements for the agent:

Autonomous for obvious cases. When the agent is highly confident, it should classify and route without waking anyone up.
Human approval for ambiguity. When confidence drops below a threshold, it pauses and asks a human to confirm or correct.
Feedback-driven improvement. Every human correction becomes a permanent example that future classifications can learn from.
Full audit trail. Every decision, every tool call, every human interaction is traceable for compliance.

Let's look at how the pieces fit together.

The architecture

              ┌──────────────────┐
              │  Azure Function  │
              │   (HTTP trigger) │
              └────────┬─────────┘
                       │
                       ▼
              ┌──────────────────┐
              │  Foundry Agent   │
              │ (Persistent SDK) │
              └────────┬─────────┘
                       │
          ┌────────────┼────────────┐
          ▼            ▼            ▼
    ┌─────────┐  ┌──────────┐  ┌──────────┐
    │ Extract │  │  Search  │  │   HITL   │
    │   Tool  │  │ Feedback │  │ Approval │
    └─────────┘  └────┬─────┘  └────┬─────┘
                      ▼             ▼
                ┌──────────────────────┐
                │   Cosmos DB (Vector) │
                │ Documents + Feedback │
                └──────────────────────┘

Azure Functions hosts the entry point. A document arrives (queue message, HTTP call, blob trigger), and the Function kicks off the agent.
Foundry Agent with Persistent SDK. The agent has a persistent thread so state survives across invocations — critical for human-in-the-loop flows that might span hours.
Tools are C# methods the agent can invoke: extract document data, retrieve similar past decisions, request human approval, store feedback.
Cosmos DB for NoSQL with integrated vector search acts as the agent's long-term memory. Every document the agent sees and every human feedback gets embedded and stored here.
Human-in-the-Loop (HITL) sends approval requests to a Teams channel or email inbox when the agent isn't confident enough to decide alone.
Foundry observability captures every agent run, tool call, token usage, and decision path as OpenTelemetry traces.

The persistent agent loop

When a document arrives, the flow looks like this:

The Azure Function receives the document and calls the agent's API
The agent creates or resumes a persistent thread — this is key, because if human approval is required, the thread can be suspended and resumed hours later without losing any context
The agent calls its extract_document_data tool to pull out key fields (parties, dates, amounts, document type hints)
The agent calls retrieve_similar_feedback — a vector search tool that returns the five most similar past documents and the human decisions made on them
The agent reasons about the classification using the extracted data plus the retrieved examples as in-context learning
If confidence is above the threshold, the agent calls store_classification and returns
If confidence is below the threshold, the agent calls request_human_approval, which suspends the thread and sends an adaptive card to the approver in Teams
When the human responds, the thread resumes from exactly where it left off
The final decision — human-corrected or agent-made — is stored as a new feedback embedding so future runs can learn from it

Because the agent is persistent, step 7 can take five minutes or five hours. The function returns, the agent's state is preserved in Foundry's managed storage, and when the human approval arrives, a different Azure Function invocation picks up exactly where the first one stopped.

Tools as first-class code

In the Foundry Agent SDK, tools are just functions the agent is allowed to invoke. You define them in your host language (C#, Python, or TypeScript), register them with the agent, and the LLM decides when to call them.

A simplified tool definition for extracting document data looks something like this:

// Illustrative - actual API surface evolves
public class DocumentTools
{
    private readonly IDocumentExtractor _extractor;
    private readonly ICosmosVectorStore _vectorStore;

    [AgentTool("extract_document_data",
        "Extract key fields from a document image or PDF")]
    public async Task<ExtractedData> ExtractAsync(
        string documentId,
        CancellationToken ct)
    {
        var blob = await _extractor.LoadAsync(documentId, ct);
        return await _extractor.ExtractFieldsAsync(blob, ct);
    }

    [AgentTool("retrieve_similar_feedback",
        "Find the 5 most similar past classifications and the human decisions made on them")]
    public async Task<IReadOnlyList<FeedbackExample>> RetrieveSimilarAsync(
        ExtractedData data,
        CancellationToken ct)
    {
        var embedding = await _vectorStore.EmbedAsync(data.ToText(), ct);
        return await _vectorStore.SearchFeedbackAsync(embedding, topK: 5, ct);
    }
}

The agent sees these as functions it can call mid-conversation, with strongly-typed parameters and return types. No JSON schema hand-writing, no manual argument parsing — the SDK handles the serialisation.

Human-in-the-loop: the pause/resume pattern

The HITL tool is the most interesting one. When the agent calls it, we don't want to block the Azure Function waiting for a human — that could take hours and you'd burn compute for nothing.

Instead, the tool:

Stores the pending decision in Cosmos DB with a status of AwaitingApproval
Sends an adaptive card to the approver in Teams with Approve/Reject/Correct buttons
Returns a PendingApproval result to the agent
The agent recognises this as a terminal state for this invocation and saves its thread
The Function exits

Separately:

When the human clicks Approve or Correct, a webhook hits a second Azure Function
That function looks up the pending decision, updates Cosmos with the human response
It resumes the agent's persistent thread, injecting the human feedback as the tool response
The agent continues from exactly where it stopped — it sees the human's decision as the result of its HITL tool call, records it as feedback, stores the final classification, and completes

This pattern — suspend on tool call, resume on external signal — is what persistent agents enable. Without persistence, you'd need Durable Functions or some other orchestration layer to manage the state manually.

The feedback loop that makes it self-improving

Here's where the "learning" happens, and it's important to be precise about what's actually going on: we're not fine-tuning the LLM. The underlying GPT-4-class model stays exactly as Azure provides it. What we're doing is in-context learning via retrieval — giving the agent access to a growing library of examples at inference time.

The cycle:

Every classification decision produces a feedback record. If the agent was confident and acted alone, the feedback is the document + classification. If HITL was involved, the feedback is the document + the human's final decision + whether the agent's original guess was right.
Each feedback record is embedded. We use an Azure OpenAI embedding model to turn the extracted document data + classification into a vector, and store it in Cosmos DB alongside the record.
Cosmos DB's integrated vector index makes this searchable. Cosmos for NoSQL now supports DiskANN-based vector indexing that can handle millions of embeddings with sub-second query latency.
On the next classification, the agent retrieves the N most similar past examples as part of its reasoning. These get injected into its context window as "here's how similar documents were handled previously, including cases where a human corrected the initial guess."
The LLM uses these examples as in-context learning. It's not training — it's few-shot prompting at scale, automatically, every time.

The practical result: the first hundred documents the agent sees, it's guessing with no context. By the thousandth document, it's drawing on a thousand human-curated examples of similar classifications. Edge cases that used to need human approval start getting decided correctly without intervention, because the agent has seen them before.

Over time, the confidence threshold for HITL can be tightened — the agent becomes more discerning about when it actually needs help, because the retrieval is doing more of the work.

Observability: tracing and telemetry in Foundry

This is the part that separates "we built an agent" from "we built an agent we can operate." Azure AI Foundry has first-class tracing for agents:

Every agent run appears in the Foundry traces view
Every tool call is a span with input, output, latency, and token cost
Every LLM call shows the system prompt, user messages, tool responses, and the model's reasoning
Distributed tracing lets you follow a single document from the HTTP trigger on the Function, through the agent's tool calls, into Cosmos DB, out to Teams for approval, and back

Without this, debugging an agent is genuinely miserable. The LLM makes a decision based on context you can't see, based on retrieved examples you don't know about, based on a system prompt that might have been modified in a config file somewhere. You need to be able to replay the full decision path. Foundry gives you that for free, and it integrates with OpenTelemetry so the same traces flow into Application Insights or any OTel-compatible backend if you want unified observability across your broader system.

We treat the traces view as a debugging UI during development and as a compliance artifact in production. Every classification decision can be traced end-to-end, including the exact feedback examples the agent retrieved and used.

Where to start

If you're building your first agentic system on Foundry, resist the urge to build everything at once. Start narrower than feels comfortable:

Pick one document type. Don't try to classify everything on day one.
Build one tool. Just the extraction, no retrieval, no HITL. Make sure the agent can see a document and return structured data.
Add Cosmos DB storage. No vector search yet — just logging every decision.
Add the retrieval tool. Start feeding past decisions back in. Measure the improvement.
Add HITL last. Only once you understand the confidence distribution enough to set a useful threshold.

Each step is independently valuable, and you get to see the agent improve meaningfully at each stage. Trying to build the full HITL + retrieval + feedback loop from scratch usually ends in a half-working prototype that never ships.

Closing thought

The agentic AI space is moving fast, and Azure AI Foundry is genuinely one of the best platforms to build on right now — particularly if you're already in the Microsoft ecosystem, using Azure OpenAI, or running .NET workloads. The combination of persistent agents, tools as first-class code, Cosmos DB vector search as memory, and built-in tracing gives you a production-grade foundation that would have taken months to assemble two years ago.

The pattern in this post — persistent agent, tools, HITL, feedback-augmented retrieval — is portable across many use cases beyond document classification. Invoice approval, contract review, compliance triage, customer enquiry routing, ticket escalation, lead qualification. Anywhere you have a repetitive decision with human judgement at the edges, it fits.

Want help building an agent like this?

If you're exploring agentic AI, evaluating Azure AI Foundry for a real project, or want to turn a proof-of-concept into a production system — get in touch. We build agents like this on Foundry, Semantic Kernel, and Azure Functions, with human-in-the-loop safeguards and production-grade observability from day one. More details on our AI Agents & Automation service.