vllm/docs/serving/offline_inference.md at 1cb194a0183db9b51cec6cb9ff473c276d8186de

mirror of https://github.com/vllm-project/vllm.git synced 2025-12-06 15:04:47 +08:00

Files

Cyrus Leung 1cb194a018 [Doc] Reorganize user guide (#18661 )

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

2025-05-24 07:25:33 -07:00

989 B

Raw Blame History

title

title
Offline Inference

{ #offline-inference }

You can run vLLM in your own code on a list of prompts.

The offline API is based on the [LLM][vllm.LLM] class. To initialize the vLLM engine, create a new instance of LLM and specify the model to run.

For example, the following code downloads the facebook/opt-125m model from HuggingFace and runs it in vLLM using the default configuration.

from vllm import LLM

llm = LLM(model="facebook/opt-125m")

After initializing the LLM instance, you can perform model inference using various APIs. The available APIs depend on the type of model that is being run:

[Generative models][generative-models] output logprobs which are sampled from to obtain the final output text.
[Pooling models][pooling-models] output their hidden states directly.

Please refer to the above pages for more details about each API.

!!! info [API Reference][offline-inference-api]

989 B Raw Blame History

989 B

Raw Blame History