vllm

Github Repo

VLLM is a high performance LLM server middleware that is optimised for multiple users (as opposed to ollama which is optimised for single user/desktop experience).

Minimal Docker Compose File

This docker compose file runs the GPTQ quantized version of Llama3.1 with a volume mount for huggingface downloads (so that the container doesn’t try to fetch the full model every time it is restarted).

services:
 
  vllm:
    image: vllm/vllm-openai:latest
    command: --host 0.0.0.0  --model hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4
    ports:
      - 8003:8000
    #environment:
    volumes:
      - "/home/<user>/.cache/huggingface:/root/.cache/huggingface"
 
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0"]
              capabilities: [gpu]

Multi LoRA Support

VLLM is capable of serving multiple LoRA adapters for the same base model which means that the base model’s memory footprint doesn’t need to be duplicated.

Benjamin Marie writes about this here (Paywalled, private wallabag link here)

Tool Calling

VLLM Docs

vllm serve meta-llama/Llama-3.1-8B-Instruct \
    --enable-auto-tool-choice \
    --tool-call-parser llama3_json

For the Qwen25 family of models, as per the docs:

vllm serve Qwen/Qwen2.5-7B-Instruct \
  --enable-auto-tool-choice \
  --tool-call-parser hermes

ValueError: Cannot find the config file for awq.

See here - vLLM supports awq quantization but models must be pre-quantized using a tool like AutoAWQ. This error happens because vLLM cannot find the config file for loading the AWQ model.

Build for Old GPUS

VLLM does not support pascal generation GPUS out of the box. These GPUs use CUDA compute level 6.1. We can manually add this to the docker build by setting

Brainsteam

Explorer

vllm

Minimal Docker Compose File

Multi LoRA Support

Tool Calling

ValueError: Cannot find the config file for awq.

Build for Old GPUS

Graph View

Table of Contents

Backlinks

Brainsteam

Explorer

vllm

Minimal Docker Compose File

Multi LoRA Support

Tool Calling

ValueError: Cannot find the config file for awq.

Build for Old GPUS

Related:

Graph View

Table of Contents

Backlinks