Understanding Latency Metrics for LLM API Calls from an Integration Perspective
Recently I have been working with LLM api calls a lot and realized that the latency is a very crucial part of the LLM integration. In case of a conversational implementation on top of LLM APIs like customer support, chat bot, voice bot the latency becomes even more crucial as the users of these applications won't get a good experience if it's not responsive enough.
Table of Contents
Understanding Latency Metrics for LLM API Calls
Latency can make or break the user experience, hence measuring it is important as well. As they say, what we can't measure, we can't improve.
Interestingly the latency metrics for LLM are not like our regular HTTP API calls.
Key Latency Metrics
Here are different latency metrics that I have implemented -
1. Time to First Token (TTFT) / TPIT (Time Per Initial Token) / TTFC (Time to First Chunk)
TTFT measures the time taken from when a request is sent to the API to when the first token is received.
It captures the initial processing time, including model loading and early inference stages.
Lots of models don't return a single token in the stream response but a chunk, hence time for the first token or chunk is interchangeable from developer perspective.
Why it matters
TTFT is crucial for interactive applications, such as chatbots or real-time text generation. A lower TTFT ensures that users don’t experience unnecessary delays before seeing the model’s response, making the application feel more responsive.
Even if it takes more time to actually finish the whole response, because the user has something to see on screen, the response seems instant.
2. Inter-token/Inter-chunk Latency (ITL)
As the name suggests, this is the latency between different chunks in a single response.
This helps to identify different timing metrics for a single response like average latency for tokens in a response, p95 etc. Tracking this is also important as it gives a visibility into if there was any abnormal behavior among the chunks from latency perspective
3. End-to-End Request Latency
e2e_latency refers to the total time taken from sending a request to receiving the complete response. Based on the pre, post processing done, it's up to the developer to include pre-processing, post-processing here or keep it separate.
Keeping it separate gives a clear visibility into the actual latency for LLM calls
4. Pre-processing and Post-processing Time
Lot of times the responses are stored in a separate system either directly or using some library so that any errors etc can be identified.
This means we need to track it separately.
Wrapping up
Apart from these metrics, we can also track requests per second so that we know at system level the number of requests being sent to the LLM provider to be able to avoid rate limits.
There are tools like Graphsignal which act as a wrapper over LLM calls and capture some of these metrics for monitoring purposes. I am going to deep dive in such tools in another blog post!
References -
- LLM inference latency metrics
- https://docs.nvidia.com/nim/benchmarking/llm/latest/metrics.html
- https://dagshub.com/blog/llm-evaluation-metrics/
- https://sambanova.ai/blog/tokens-per-second-is-not-all-you-need
- https://graphsignal.com/blog/open-ai-api-cost-tracking-analyzing-expenses-by-model-deployment-and-context/
- https://blog.spheron.network/best-practices-for-llm-inference-performance-monitoring
- https://graphsignal.com/docs/guides/quick-start/