Inference

Inference is the process of running a trained AI model on new inputs to generate outputs. Every time you send a document to an AI for processing, you are performing inference. It is the operational phase of AI — distinct from training, which is where models learn.

What is Inference?

Building an AI model involves two distinct phases. Training is where the model learns: it processes millions of examples, adjusts its internal parameters, and builds up its capabilities. Inference is where the model performs: it takes a new input it has never seen before and applies what it learned to produce an output.

For most operational teams, inference is the only phase they interact with. When you call an AI API to extract fields from an invoice, classify an exception, or generate a supplier response, the model is performing inference. Training happened earlier — typically by the model provider — and the resulting weights are what you use at runtime.

Inference Performance: What Matters

Two metrics define inference performance in operational contexts:

  • Latency: How long does it take to get a response? For interactive workflows — a user waiting for a result — latency below two seconds is generally acceptable. For fully automated batch processing, latency matters less than throughput.

  • Throughput: How many requests can the system process per unit of time? A model that takes five seconds per document but processes 50 simultaneously may have higher effective throughput than a faster model running sequentially.

Inference cost is also a real operational variable. Cloud AI APIs charge per token processed. At high document volumes — thousands of invoices per day — inference costs need to be modelled into the business case.

Inference in Operations

When planning an operational AI deployment, inference is where the architecture decisions bite. How many documents per hour need to be processed? What is the acceptable latency before a human reviewer gets the AI's output? Are you calling a cloud API or running a self-hosted model? These are not abstract technical questions — they determine whether the system keeps up with your operations or becomes a bottleneck. Get the inference architecture right before optimising anything else.

Turn your manual decisions into intelligent operations

See how we capture your decision intelligence and put it to work inside the systems you already have. Start with one workflow. See results in days.

Turn your manual decisions into intelligent operations

See how we capture your decision intelligence and put it to work inside the systems you already have. Start with one workflow. See results in days.