Latency

Latency is the time it takes for an AI system to return a response after receiving an input. In operational AI, it determines whether automated workflows feel instantaneous or whether they become a bottleneck that slows down the people and processes they are supposed to accelerate.

What is Latency?

In computing, latency refers to the delay between a request and a response. For AI systems, latency is the time from submitting an input — a document, a query, an extraction request — to receiving the model's output. It is measured in milliseconds for fast systems and seconds for slower ones.

Latency is not the same as throughput. A system can have high latency (each request takes five seconds) but high throughput (it processes 100 requests simultaneously). Depending on your workflow design, one may matter more than the other.

What Drives Latency in AI Systems

Several factors determine how fast an AI system responds:

  • Model size: Larger models with more parameters take longer to run inference. A 7B-parameter model responds faster than a 70B-parameter model on equivalent hardware.

  • Input and output length: Processing a 50-page contract takes longer than processing a 2-page invoice. Output generation is token-by-token — longer outputs take proportionally longer.

  • Infrastructure: GPU type, network proximity to the model server, and batch size all affect response time.

  • API load: Cloud AI APIs introduce variable latency based on current demand. During peak hours, response times can increase significantly.

Latency in Operations

For operational workflows, latency matters differently depending on the use case. A fully automated batch process running overnight can tolerate 10-second responses per document. An agent embedded in an ERP interface that a controller uses in real time cannot — responses over two to three seconds break the work rhythm. Design your AI workflows with latency requirements in mind from the start. If real-time interaction is required, choose smaller, faster models and optimise prompts for brevity. If batch processing is acceptable, you can use larger, more capable models without the latency constraint.

Turn your manual decisions into intelligent operations

See how we capture your decision intelligence and put it to work inside the systems you already have. Start with one workflow. See results in days.

Turn your manual decisions into intelligent operations

See how we capture your decision intelligence and put it to work inside the systems you already have. Start with one workflow. See results in days.