coderquill's inklings

Understanding Latency Metrics for LLM API Calls from an Integration Perspective

Recently I have been working with LLM api calls a lot and realized that the latency is a very crucial part of the LLM integration. In case of a conversational implementation on top of LLM APIs like customer support, chat bot, voice bot the latency becomes even more crucial as the users of these applications won't get a good experience if it's not responsive enough.

Table of Contents

Understanding Latency Metrics for LLM API Calls

Latency can make or break the user experience, hence measuring it is important as well. As they say, what we can't measure, we can't improve.

Interestingly the latency metrics for LLM are not like our regular HTTP API calls.

Understanding Latency Metrics for LLM API Calls

Key Latency Metrics

Here are different latency metrics that I have implemented -

1. Time to First Token (TTFT) / TPIT (Time Per Initial Token) / TTFC (Time to First Chunk)

TTFT measures the time taken from when a request is sent to the API to when the first token is received.

It captures the initial processing time, including model loading and early inference stages.

Lots of models don't return a single token in the stream response but a chunk, hence time for the first token or chunk is interchangeable from developer perspective.

Why it matters

TTFT is crucial for interactive applications, such as chatbots or real-time text generation. A lower TTFT ensures that users don’t experience unnecessary delays before seeing the model’s response, making the application feel more responsive.

Even if it takes more time to actually finish the whole response, because the user has something to see on screen, the response seems instant.

2. Inter-token/Inter-chunk Latency (ITL)

As the name suggests, this is the latency between different chunks in a single response.

This helps to identify different timing metrics for a single response like average latency for tokens in a response, p95 etc. Tracking this is also important as it gives a visibility into if there was any abnormal behavior among the chunks from latency perspective

3. End-to-End Request Latency

e2e_latency refers to the total time taken from sending a request to receiving the complete response. Based on the pre, post processing done, it's up to the developer to include pre-processing, post-processing here or keep it separate.

Keeping it separate gives a clear visibility into the actual latency for LLM calls

4. Pre-processing and Post-processing Time

Lot of times the responses are stored in a separate system either directly or using some library so that any errors etc can be identified.

This means we need to track it separately.

Wrapping up

Apart from these metrics, we can also track requests per second so that we know at system level the number of requests being sent to the LLM provider to be able to avoid rate limits.

There are tools like Graphsignal which act as a wrapper over LLM calls and capture some of these metrics for monitoring purposes. I am going to deep dive in such tools in another blog post!

1

References -

  1. If you found this helpful, please share it to help others find it! Feel free to connect with me on any of these platforms=> Email | LinkedIn | Resume | Github | Twitter | Instagram 💜