mirror of
https://github.com/vllm-project/vllm.git
synced 2025-12-06 06:53:12 +08:00
[Docs] Enable some more markdown lint rules for the docs (#28731)
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
This commit is contained in:
@@ -6,9 +6,6 @@ MD024:
|
||||
MD031:
|
||||
list_items: false
|
||||
MD033: false
|
||||
MD045: false
|
||||
MD046: false
|
||||
MD051: false
|
||||
MD052: false
|
||||
MD053: false
|
||||
MD059: false
|
||||
|
||||
@@ -10,8 +10,6 @@ vLLM provides comprehensive benchmarking tools for performance testing and evalu
|
||||
- **[Parameter sweeps](#parameter-sweeps)**: Automate `vllm bench` runs for multiple configurations
|
||||
- **[Performance benchmarks](#performance-benchmarks)**: Automated CI benchmarks for development
|
||||
|
||||
[Benchmark CLI]: #benchmark-cli
|
||||
|
||||
## Benchmark CLI
|
||||
|
||||
This section guides you through running benchmark tests with the extensive
|
||||
|
||||
@@ -95,7 +95,7 @@ when manually triggering a build on Buildkite. This branch accomplishes two thin
|
||||
to warm it up so that future builds are faster.
|
||||
|
||||
<p align="center" width="100%">
|
||||
<img width="60%" src="https://github.com/user-attachments/assets/a8ff0fcd-76e0-4e91-b72f-014e3fdb6b94">
|
||||
<img width="60%" alt="Buildkite new build popup" src="https://github.com/user-attachments/assets/a8ff0fcd-76e0-4e91-b72f-014e3fdb6b94">
|
||||
</p>
|
||||
|
||||
## Update dependencies
|
||||
|
||||
@@ -29,8 +29,8 @@ pip install vllm
|
||||
- API Path: `/chat/completions`
|
||||
- Model: `qwen/Qwen1.5-0.5B-Chat`
|
||||
|
||||

|
||||

|
||||
|
||||
1. Go to `Just chat`, and start to chat:
|
||||
|
||||

|
||||

|
||||
|
||||
@@ -46,12 +46,12 @@ And install [Docker](https://docs.docker.com/engine/install/) and [Docker Compos
|
||||
- **Model Name for API Endpoint**: `Qwen/Qwen1.5-7B-Chat`
|
||||
- **Completion Mode**: `Completion`
|
||||
|
||||

|
||||

|
||||
|
||||
1. To create a test chatbot, go to `Studio → Chatbot → Create from Blank`, then select Chatbot as the type:
|
||||
|
||||

|
||||

|
||||
|
||||
1. Click the chatbot you just created to open the chat interface and start interacting with the model:
|
||||
|
||||

|
||||

|
||||
|
||||
@@ -19,9 +19,9 @@ The input activation format completely depends on the All2All Dispatch being use
|
||||
|
||||
The FusedMoE operation is generally made of multiple operations, in both the Contiguous and Batched variants, as described in the diagrams below
|
||||
|
||||

|
||||

|
||||
|
||||

|
||||

|
||||
|
||||
!!! note
|
||||
The main difference, in terms of operations, between the Batched and Non-Batched cases is the Permute / Unpermute operations. All other operations remain.
|
||||
@@ -57,7 +57,7 @@ The `FusedMoEModularKernel` acts as a bridge between the `FusedMoEPermuteExperts
|
||||
The `FusedMoEPrepareAndFinalize` abstract class exposes `prepare`, `prepare_no_receive` and `finalize` functions.
|
||||
The `prepare` function is responsible for input activation Quantization and All2All Dispatch. If implemented, The `prepare_no_receive` is like `prepare` except it does not wait to receive results from other workers. Instead it returns a "receiver" callback that must be invoked to wait for the final results of worker. It is not required that this method is supported by all `FusedMoEPrepareAndFinalize` classes, but if it is available, it can be used to interleave work with the initial all to all communication, e.g. interleaving shared experts with fused experts. The `finalize` function is responsible for invoking the All2All Combine. Additionally the `finalize` function may or may not do the TopK weight application and reduction (Please refer to the TopKWeightAndReduce section)
|
||||
|
||||

|
||||

|
||||
|
||||
### FusedMoEPermuteExpertsUnpermute
|
||||
|
||||
@@ -88,7 +88,7 @@ The core FusedMoE implementation performs a series of operations. It would be in
|
||||
It is sometimes efficient to perform TopK weight application and Reduction inside the `FusedMoEPermuteExpertsUnpermute::apply()`. Find an example [here](https://github.com/vllm-project/vllm/pull/20228). We have a `TopKWeightAndReduce` abstract class to facilitate such implementations. Please refer to the TopKWeightAndReduce section.
|
||||
`FusedMoEPermuteExpertsUnpermute::finalize_weight_and_reduce_impl()` returns the `TopKWeightAndReduce` object that the implementation wants the `FusedMoEPrepareAndFinalize::finalize()` to use.
|
||||
|
||||

|
||||

|
||||
|
||||
### FusedMoEModularKernel
|
||||
|
||||
|
||||
Reference in New Issue
Block a user