mirror of https://github.com/vllm-project/vllm.git synced 2025-12-06 15:04:47 +08:00

Files

Harry Mellor 483ea64611 [Docs] Replace all explicit anchors with real links (#27087 )

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

2025-10-17 02:22:06 -07:00

2.4 KiB

Raw Permalink Blame History

Offline Inference

Offline inference is possible in your own code using vLLM's [LLM][vllm.LLM] class.

For example, the following code downloads the facebook/opt-125m model from HuggingFace and runs it in vLLM using the default configuration.

from vllm import LLM

# Initialize the vLLM engine.
llm = LLM(model="facebook/opt-125m")

After initializing the LLM instance, use the available APIs to perform model inference. The available APIs depend on the model type:

Generative models output logprobs which are sampled from to obtain the final output text.
Pooling models output their hidden states directly.

!!! info API Reference

Ray Data LLM API

Ray Data LLM is an alternative offline inference API that uses vLLM as the underlying engine. This API adds several batteries-included capabilities that simplify large-scale, GPU-efficient inference:

Streaming execution processes datasets that exceed aggregate cluster memory.
Automatic sharding, load balancing, and autoscaling distribute work across a Ray cluster with built-in fault tolerance.
Continuous batching keeps vLLM replicas saturated and maximizes GPU utilization.
Transparent support for tensor and pipeline parallelism enables efficient multi-GPU inference.
Reading and writing to most popular file formats and cloud object storage.
Scaling up the workload without code changes.

??? code

```python
import ray  # Requires ray>=2.44.1
from ray.data.llm import vLLMEngineProcessorConfig, build_llm_processor

config = vLLMEngineProcessorConfig(model_source="unsloth/Llama-3.2-1B-Instruct")
processor = build_llm_processor(
    config,
    preprocess=lambda row: {
        "messages": [
            {"role": "system", "content": "You are a bot that completes unfinished haikus."},
            {"role": "user", "content": row["item"]},
        ],
        "sampling_params": {"temperature": 0.3, "max_tokens": 250},
    },
    postprocess=lambda row: {"answer": row["generated_text"]},
)

ds = ray.data.from_items(["An old silent pond..."])
ds = processor(ds)
ds.write_parquet("local:///tmp/data/")
```

For more information about the Ray Data LLM API, see the Ray Data LLM documentation.

2.4 KiB Raw Permalink Blame History

Offline Inference

Ray Data LLM API

2.4 KiB

Raw Permalink Blame History