LLM Latency Is Output-Size Bound

Published

As it stands today, LLM applications have noticeable latency but much of the latency is output-size bound rather than input-size bound. That means the amount of text that goes into a prompt does not matter.

The way transformers and attention layers work, LLMs are basically predicting each successive word in the response so it makes sense that a longer response will take longer regardless of the input size. For example, a short prompt, “write me a 5 paragraph essay on computers”, will have higher latency than “write me one sentence about computers”.

However, when using techniques like RAG, the input might be used to fetch related documents that could introduce several more operations (including more LLM prompts) which would have an impact on overall latency.

Read What We’ve Learned From A Year of Building with LLMs.

See also: