Files

Benchmark KV Cache Offloading with Multi-Turn Conversations

The requirements (pip) for benchmark_serving_multi_turn.py can be found in requirements.txt

First start serving your model

export MODEL_PATH=/models/meta-llama/Meta-Llama-3.1-8B-Instruct/

vllm serve $MODEL_PATH --served-model-name Llama --disable-log-requests

The variable MODEL_PATH should be a path to the model files (e.g. downloaded from huggingface).

Synthetic Multi-Turn Conversations

Download the following text file (used for generation of synthetic conversations)

wget https://www.gutenberg.org/ebooks/1184.txt.utf-8
mv 1184.txt.utf-8 pg1184.txt

The filename pg1184.txt is used in generate_multi_turn.json (see "text_files").

But you may use other text files if you prefer (using this specific file is not required).

Then run the benchmarking script

export MODEL_PATH=/models/meta-llama/Meta-Llama-3.1-8B-Instruct/

python benchmark_serving_multi_turn.py --model $MODEL_PATH --served-model-name Llama \
--input-file generate_multi_turn.json --num-clients 2 --max-active-conversations 6

You can edit the file generate_multi_turn.json to change the conversation parameters (number of turns, etc.).

If successful, you will see the following output

----------------------------------------------------------------------------------------------------
Statistics summary:
runtime_sec = 215.810
requests_per_sec = 0.769
----------------------------------------------------------------------------------------------------
                   count     mean     std      min      25%      50%      75%      90%      99%      max
ttft_ms            166.0    78.22   67.63    45.91    59.94    62.26    64.43    69.66   353.18   567.54
tpot_ms            166.0    25.37    0.57    24.40    25.07    25.31    25.50    25.84    27.50    28.05
latency_ms         166.0  2591.07  326.90  1998.53  2341.62  2573.01  2860.10  3003.50  3268.46  3862.94
input_num_turns    166.0     7.43    4.57     1.00     3.00     7.00    11.00    13.00    17.00    17.00
input_num_tokens   166.0  2006.20  893.56   522.00  1247.75  2019.00  2718.00  3233.00  3736.45  3899.00
output_num_tokens  166.0   100.01   11.80    80.00    91.00    99.00   109.75   116.00   120.00   120.00
output_num_chunks  166.0    99.01   11.80    79.00    90.00    98.00   108.75   115.00   119.00   119.00
----------------------------------------------------------------------------------------------------

If you run with --warmup-step, the summary will also include warmup_runtime_sec and total_runtime_incl_warmup_sec (while runtime_sec continues to reflect the benchmark-only runtime so the reported throughput stays comparable).

JSON configuration file for synthetic conversations generation

The input flag --input-file is used to determine the input conversations for the benchmark.
When the input is a JSON file with the field "filetype": "generate_conversations" the tool will generate synthetic multi-turn (questions and answers) conversations.

The file generate_multi_turn.json is an example file.

The file must contain the sections prompt_input and prompt_output.

The prompt_input section must contain num_turns, prefix_num_tokens and num_tokens:

  • num_turns - Number of total turns in the conversation (both user & assistant).
    The final value will always be rounded to an even number so each user turn has a reply.
  • prefix_num_tokens - Tokens added at the start of only the first user turn in a conversation (unique per conversation).
  • num_tokens - Total token length of each user message (one turn).

The prompt_output section must contain num_tokens:

  • num_tokens - Total token length of each assistant message (one turn).

Random distributions for synthetic conversations generation

When creating an input JSON file (such as generate_multi_turn.json),
every numeric field (such as num_turns or num_tokens) requires a distribution.
The distribution determines how to randomly sample values for the field.

The available distributions are listed below.

Note: The optional max field (for lognormal, zipf, and poisson) can be used to cap sampled values at an upper bound.
Can be used to make sure that the total number of tokens in every request does not exceed --max-model-len.

constant

{
    "distribution": "constant",
    "value": 500
}
  • value - the fixed integer value (always returns the same number).

uniform

{
    "distribution": "uniform",
    "min": 12,
    "max": 18
}
  • min - minimum value (inclusive).
  • max - maximum value (inclusive), should be equal or larger than min.

lognormal

{
    "distribution": "lognormal",
    "average": 1000,
    "max": 5000
}

You can parameterize the lognormal distribution in one of two ways:

Using the average and optional median ratio:

  • average - target average value of the distribution.
  • median_ratio - the ratio of the median to the average; controls the skewness. Must be in the range (0, 1).

Using the parameters of the underlying normal distribution:

  • mean - mean of the underlying normal distribution.
  • sigma - standard deviation of the underlying normal distribution.

zipf

{
    "distribution": "zipf",
    "alpha": 1.2,
    "max": 100
}
  • alpha - skew parameter (> 1). Larger values produce stronger skew toward smaller integers.

poisson

{
    "distribution": "poisson",
    "alpha": 10,
    "max": 50
}
  • alpha - expected value (λ). Also the variance of the distribution.

ShareGPT Conversations

To run with the ShareGPT data, download the following ShareGPT dataset: https://huggingface.co/datasets/philschmid/sharegpt-raw/blob/main/sharegpt_20230401_clean_lang_split.json

Use the convert_sharegpt_to_openai.py script to convert the dataset to a format supported by benchmark_serving_multi_turn.py

python convert_sharegpt_to_openai.py sharegpt_20230401_clean_lang_split.json sharegpt_conv_128.json --seed=99 --max-items=128

The script will convert the ShareGPT dataset to a dataset with the standard user/assistant roles.

The flag --max-items=128 is used to sample 128 conversations from the original dataset (change as needed).

Use the output JSON file sharegpt_conv_128.json as the --input-file for benchmark_serving_multi_turn.py.