Github Repo

Minimal Docker Compose:

services:
  llamacpp:
    image: ghcr.io/allenporter/llama-cpp-server-cuda
    volumes:
      - ./models/:/data/models:rw
      - ./config/config.json:/data/config.json
    environment:
      - CONFIG_FILE=/data/config.json
    ports:
      - 8000:8000
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Minimal Config File. Full documentation for config json files available here.

{
    "host": "0.0.0.0",
    "port": 8080,
    "models": [
        {
            "model": "/data/models/Meta-Llama-3-8B-Instruct.Q4_K_S.gguf",
            "model_alias": "llama3-8b-8192",
            "chat_format": "chatml",
            "n_gpu_layers": -1,
            "offload_kqv": true,
            "n_threads": 12,
            "n_batch": 512,
            "n_ctx": 8192
        }
       
    ]
}
 

Inference with Langchain

The OpenAI library requires that an API key is set so even if you don’t have auth enabled on your endpoint, just provide a garbage value in .env:

OPENAI_API_KEY=blah

Then in your python app call:

from langchain_openai import ChatOpenAI
from dotenv import load_dotenv
import os
 
load_dotenv()
 
chatbot = ChatOpenAI(base_url="http://localhost:8000/v1/", model="llama3-8b-8192")