Benchmark KV Cache Offloading with Multi-Turn Conversations
The requirements (pip) for benchmark_serving_multi_turn.py can be found in requirements.txt
First start serving your model
export MODEL_PATH=/models/meta-llama/Meta-Llama-3.1-8B-Instruct/
vllm serve $MODEL_PATH --served-model-name Llama --disable-log-requests
The variable MODEL_PATH should be a path to the model files (e.g. downloaded from huggingface).
Synthetic Multi-Turn Conversations
Download the following text file (used for generation of synthetic conversations)
wget https://www.gutenberg.org/ebooks/1184.txt.utf-8
mv 1184.txt.utf-8 pg1184.txt
The filename pg1184.txt is used in generate_multi_turn.json (see "text_files").
But you may use other text files if you prefer (using this specific file is not required).
Then run the benchmarking script
export MODEL_PATH=/models/meta-llama/Meta-Llama-3.1-8B-Instruct/
python benchmark_serving_multi_turn.py --model $MODEL_PATH --served-model-name Llama \
--input-file generate_multi_turn.json --num-clients 2 --max-active-conversations 6
You can edit the file generate_multi_turn.json to change the conversation parameters (number of turns, etc.).
If successful, you will see the following output
----------------------------------------------------------------------------------------------------
Statistics summary:
runtime_sec = 215.810
requests_per_sec = 0.769
----------------------------------------------------------------------------------------------------
count mean std min 25% 50% 75% 90% 99% max
ttft_ms 166.0 78.22 67.63 45.91 59.94 62.26 64.43 69.66 353.18 567.54
tpot_ms 166.0 25.37 0.57 24.40 25.07 25.31 25.50 25.84 27.50 28.05
latency_ms 166.0 2591.07 326.90 1998.53 2341.62 2573.01 2860.10 3003.50 3268.46 3862.94
input_num_turns 166.0 7.43 4.57 1.00 3.00 7.00 11.00 13.00 17.00 17.00
input_num_tokens 166.0 2006.20 893.56 522.00 1247.75 2019.00 2718.00 3233.00 3736.45 3899.00
output_num_tokens 166.0 100.01 11.80 80.00 91.00 99.00 109.75 116.00 120.00 120.00
output_num_chunks 166.0 99.01 11.80 79.00 90.00 98.00 108.75 115.00 119.00 119.00
----------------------------------------------------------------------------------------------------
If you run with --warmup-step, the summary will also include warmup_runtime_sec
and total_runtime_incl_warmup_sec (while runtime_sec continues to reflect the
benchmark-only runtime so the reported throughput stays comparable).
JSON configuration file for synthetic conversations generation
The input flag --input-file is used to determine the input conversations for the benchmark.
When the input is a JSON file with the field "filetype": "generate_conversations" the tool will generate synthetic multi-turn (questions and answers) conversations.
The file generate_multi_turn.json is an example file.
The file must contain the sections prompt_input and prompt_output.
The prompt_input section must contain num_turns, prefix_num_tokens and num_tokens:
num_turns- Number of total turns in the conversation (both user & assistant).
The final value will always be rounded to an even number so each user turn has a reply.prefix_num_tokens- Tokens added at the start of only the first user turn in a conversation (unique per conversation).num_tokens- Total token length of each user message (one turn).
The prompt_output section must contain num_tokens:
num_tokens- Total token length of each assistant message (one turn).
Random distributions for synthetic conversations generation
When creating an input JSON file (such as generate_multi_turn.json),
every numeric field (such as num_turns or num_tokens) requires a distribution.
The distribution determines how to randomly sample values for the field.
The available distributions are listed below.
Note: The optional max field (for lognormal, zipf, and poisson) can be used to cap sampled values at an upper bound.
Can be used to make sure that the total number of tokens in every request does not exceed --max-model-len.
constant
{
"distribution": "constant",
"value": 500
}
value- the fixed integer value (always returns the same number).
uniform
{
"distribution": "uniform",
"min": 12,
"max": 18
}
min- minimum value (inclusive).max- maximum value (inclusive), should be equal or larger than min.
lognormal
{
"distribution": "lognormal",
"average": 1000,
"max": 5000
}
You can parameterize the lognormal distribution in one of two ways:
Using the average and optional median ratio:
average- target average value of the distribution.median_ratio- the ratio of the median to the average; controls the skewness. Must be in the range (0, 1).
Using the parameters of the underlying normal distribution:
mean- mean of the underlying normal distribution.sigma- standard deviation of the underlying normal distribution.
zipf
{
"distribution": "zipf",
"alpha": 1.2,
"max": 100
}
alpha- skew parameter (> 1). Larger values produce stronger skew toward smaller integers.
poisson
{
"distribution": "poisson",
"alpha": 10,
"max": 50
}
alpha- expected value (λ). Also the variance of the distribution.
ShareGPT Conversations
To run with the ShareGPT data, download the following ShareGPT dataset:
https://huggingface.co/datasets/philschmid/sharegpt-raw/blob/main/sharegpt_20230401_clean_lang_split.json
Use the convert_sharegpt_to_openai.py script to convert the dataset to a format supported by benchmark_serving_multi_turn.py
python convert_sharegpt_to_openai.py sharegpt_20230401_clean_lang_split.json sharegpt_conv_128.json --seed=99 --max-items=128
The script will convert the ShareGPT dataset to a dataset with the standard user/assistant roles.
The flag --max-items=128 is used to sample 128 conversations from the original dataset (change as needed).
Use the output JSON file sharegpt_conv_128.json as the --input-file for benchmark_serving_multi_turn.py.