mirror of
https://github.com/vllm-project/vllm.git
synced 2025-12-06 15:04:47 +08:00
989 B
989 B
title
| title |
|---|
| Offline Inference |
You can run vLLM in your own code on a list of prompts.
The offline API is based on the [LLM][vllm.LLM] class.
To initialize the vLLM engine, create a new instance of LLM and specify the model to run.
For example, the following code downloads the facebook/opt-125m model from HuggingFace
and runs it in vLLM using the default configuration.
from vllm import LLM
llm = LLM(model="facebook/opt-125m")
After initializing the LLM instance, you can perform model inference using various APIs.
The available APIs depend on the type of model that is being run:
- [Generative models][generative-models] output logprobs which are sampled from to obtain the final output text.
- [Pooling models][pooling-models] output their hidden states directly.
Please refer to the above pages for more details about each API.
!!! info [API Reference][offline-inference-api]