Minimal Docker Compose:
services:
llamacpp:
image: ghcr.io/allenporter/llama-cpp-server-cuda
volumes:
- ./models/:/data/models:rw
- ./config/config.json:/data/config.json
environment:
- CONFIG_FILE=/data/config.json
ports:
- 8000:8000
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
Minimal Config File. Full documentation for config json files available here.
{
"host": "0.0.0.0",
"port": 8080,
"models": [
{
"model": "/data/models/Meta-Llama-3-8B-Instruct.Q4_K_S.gguf",
"model_alias": "llama3-8b-8192",
"chat_format": "chatml",
"n_gpu_layers": -1,
"offload_kqv": true,
"n_threads": 12,
"n_batch": 512,
"n_ctx": 8192
}
]
}
Inference with Langchain
The OpenAI library requires that an API key is set so even if you don’t have auth enabled on your endpoint, just provide a garbage value in .env
:
OPENAI_API_KEY=blah
Then in your python app call:
from langchain_openai import ChatOpenAI
from dotenv import load_dotenv
import os
load_dotenv()
chatbot = ChatOpenAI(base_url="http://localhost:8000/v1/", model="llama3-8b-8192")