VLLM is a high performance LLM server middleware that is optimised for multiple users (as opposed to ollama which is optimised for single user/desktop experience).
Minimal Docker Compose File
This docker compose file runs the GPTQ quantized version of Llama3.1 with a volume mount for huggingface downloads (so that the container doesn’t try to fetch the full model every time it is restarted).
services:
 
  vllm:
    image: vllm/vllm-openai:latest
    command: --host 0.0.0.0  --model hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4
    ports:
      - 8003:8000
    #environment:
    volumes:
      - "/home/<user>/.cache/huggingface:/root/.cache/huggingface"
 
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0"]
              capabilities: [gpu]Multi LoRA Support
VLLM is capable of serving multiple LoRA adapters for the same base model which means that the base model’s memory footprint doesn’t need to be duplicated.
Benjamin Marie writes about this here (Paywalled, private wallabag link here)
Tool Calling
vllm serve meta-llama/Llama-3.1-8B-Instruct \
    --enable-auto-tool-choice \
    --tool-call-parser llama3_json For the Qwen25 family of models, as per the docs:
vllm serve Qwen/Qwen2.5-7B-Instruct \
  --enable-auto-tool-choice \
  --tool-call-parser hermesValueError: Cannot find the config file for awq.
See here - vLLM supports awq quantization but models must be pre-quantized using a tool like AutoAWQ. This error happens because vLLM cannot find the config file for loading the AWQ model.
Build for Old GPUS
VLLM does not support pascal generation GPUS out of the box. These GPUs use CUDA compute level 6.1. We can manually add this to the docker build by setting