Github Repo

VLLM is a high performance LLM server middleware that is optimised for multiple users (as opposed to ollama which is optimised for single user/desktop experience).

Minimal Docker Compose File

This docker compose file runs the GPTQ quantized version of Llama3.1 with a volume mount for huggingface downloads (so that the container doesn’t try to fetch the full model every time it is restarted).

services:
 
  vllm:
    image: vllm/vllm-openai:latest
    command: --host 0.0.0.0  --model hugging-quants/Meta-Llama-3.1-8B-Instruct-GPTQ-INT4
    ports:
      - 8003:8000
    #environment:
    volumes:
      - "/home/<user>/.cache/huggingface:/root/.cache/huggingface"
 
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0"]
              capabilities: [gpu]

Multi LoRA Support

VLLM is capable of serving multiple LoRA adapters for the same base model which means that the base model’s memory footprint doesn’t need to be duplicated.

Benjamin Marie writes about this here (Paywalled, private wallabag link here)

Build for Old GPUS

VLLM does not support pascal generation GPUS out of the box. These GPUs use CUDA compute level 6.1. We can manually add this to the docker build by setting