website github

Ollama is a local LLM runtime that runs quantized models on Nvidia, Apple Silicon and intel CPU.

Ollama has a really neat Docker-inspired model build system that allows you to run any existing gguf model in their runtime and also pre-seed system prompts

Ollama can be used easily with Open Web UI

Configuration on Linux/Systemctl Module

As per the faq doc use:

systemctl edit ollama.service and add env vars like below. Each environment variable is added as a separate Environment declaration:

[Service]
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_MAX_PARALLEL=2"

Parallel/Dynamic Batching

Ollama now supports dynamic batching and multi-threading as per this ticket. However it is still not really meant for production use with multiple end users. A better option may be software like llama-cpp-server or vllm.

Ollama Modelfile

Modefiles allow you to define models that can be run in Ollama and imported from GGUF format. Default values for model parameters and options can also be set alongside things like templates for interacting with models and stopwords. Parameters that can have multiple values may be defined multiple times like the stop parameter:

PARAMETER stop "<|endoftext|>"
PARAMETER stop "<|end_of_turn|>"
PARAMETER stop "Human:"
PARAMETER stop "Assistant:"

Function Calling

Ollama now supports function calling using OpenAI compatible specifications. The Langchain docs on structured outputs can be used to drive this:

See also: ollama function calling

DEPRECATED: Fixed Grammar and Ollama

See: Function Calling above

Constrained generation is currently not officially supported but there are a number of PRs open. I arbitrarily plumped for this issue as an entrypoint. The linked repo can be downloaded and manually compiled but would be great if this was upstreamed.

Grammars are defined in llama.cpp GBNF but can be generated based on Pydantic objects or JSON schema. A JSONSchema can be passed directly instead of a gramma.

A default grammar can be defined in the Modelfile but a grammar can also be sent as part of a post request at runtime:

{
  "model":"llama3:8b",
  "stream": false,
  "options":{"grammar":"root ::= (\"yes\" | \"no\")"},
  "messages":[
    {"role":"user", "content":"The boy rode the bike. Is there a girl in this story?"}
  ]
}

Conversion of qlora to Ollama

If you have trained a qlora model with axolotl and you need to export to Ollama:

  1. Convert the lora model to GGML using the script provided in llama.cpp:
python convert-lora-to-ggml.py ../axolotl/qlora-out
  1. Export the parent model or make sure a pre-prepared GGML version of the model is available on the system.

  2. use the export-lora command from llama.cpp to merge the lora in (nb: you can merge multiple loras into a parent model at the same time)

./export-lora -m <parent_model.gguf> -o <output_name_something_something_lora.gguf> -l <path/to/lora.gguf>
  1. Use Modelfile to import the model into ollama:
FROM output_name_something_something_lora.gguf
...

and

ollama create -f Modelfile my_model_name