Ollama is a local LLM runtime that runs quantized models on Nvidia, Apple Silicon and intel CPU.
Ollama has a really neat Docker-inspired model build system that allows you to run any existing gguf model in their runtime and also pre-seed system prompts
Ollama can be used easily with Open Web UI
Configuration on Linux/Systemctl Module
As per the faq doc use:
systemctl edit ollama.service
and add env vars like below. Each environment variable is added as a separate Environment declaration:
[Service]
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_MAX_PARALLEL=2"
Parallel/Dynamic Batching
Ollama now supports dynamic batching and multi-threading as per this ticket. However it is still not really meant for production use with multiple end users. A better option may be software like llama-cpp-server or vllm.
Ollama Modelfile
Modefiles allow you to define models that can be run in Ollama and imported from GGUF format. Default values for model parameters and options can also be set alongside things like templates for interacting with models and stopwords. Parameters that can have multiple values may be defined multiple times like the stop
parameter:
PARAMETER stop "<|endoftext|>"
PARAMETER stop "<|end_of_turn|>"
PARAMETER stop "Human:"
PARAMETER stop "Assistant:"
Function Calling
Ollama now supports function calling using OpenAI compatible specifications. The Langchain docs on structured outputs can be used to drive this:
See also: ollama function calling
DEPRECATED: Fixed Grammar and Ollama
See: Function Calling above
Constrained generation is currently not officially supported but there are a number of PRs open. I arbitrarily plumped for this issue as an entrypoint. The linked repo can be downloaded and manually compiled but would be great if this was upstreamed.
Grammars are defined in llama.cpp GBNF but can be generated based on Pydantic objects or JSON schema. A JSONSchema can be passed directly instead of a gramma.
A default grammar can be defined in the Modelfile but a grammar can also be sent as part of a post request at runtime:
Conversion of qlora
to Ollama
If you have trained a qlora model with axolotl and you need to export to Ollama:
- Convert the lora model to GGML using the script provided in llama.cpp:
-
Export the parent model or make sure a pre-prepared GGML version of the model is available on the system.
-
use the
export-lora
command from llama.cpp to merge the lora in (nb: you can merge multiple loras into a parent model at the same time)
- Use Modelfile to import the model into ollama:
FROM output_name_something_something_lora.gguf
...
and