VLLM is a high performance LLM server middleware that is optimised for multiple users (as opposed to ollama which is optimised for single user/desktop experience).
Minimal Docker Compose File
This docker compose file runs the GPTQ quantized version of Llama3.1 with a volume mount for huggingface downloads (so that the container doesn’t try to fetch the full model every time it is restarted).
Multi LoRA Support
VLLM is capable of serving multiple LoRA adapters for the same base model which means that the base model’s memory footprint doesn’t need to be duplicated.
Benjamin Marie writes about this here (Paywalled, private wallabag link here)
Build for Old GPUS
VLLM does not support pascal generation GPUS out of the box. These GPUs use CUDA compute level 6.1. We can manually add this to the docker build by setting