LLM Utility

I’m a big fan of Simon Willison’s llm package. It works nicely with llama-cpp.

Installing llm

I didn’t get on well with pipx in this use case so I used conda to create a virtual environments for LLM and then installed it in there.

Since I have an NVIDIA card I pass in CMAKE flags to have it build support for cuda:

conda create -y -n llm python=3.10
conda activate llm
pip install llm llm-llama-cpp
CMAKE_ARGS="-DLLAMA_CUBLAS=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc" FORCE_CMAKE=1 llm install llama-cpp-python
 
# alternatively if no NVIDIA support is available, this works well
# CMAKE_ARGS="-DLLAMA_OPENBLAS=on" FORCE_CMAKE=1 llm install llama-cpp-python

Installing a Model

The LLM utility can automatically download gguf formatted models from huggingface:

llm llama-cpp download-model https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF/resolve/main/mistral-7b-v0.1.Q5_K_M.gguf?download=true

Running Model with GPU Offload

Make use of the n_gpu_layers option to offload all model layers to the GPU if you have enough vram - this should speed up the generation process significantly.

llm chat -m mistral-7b-v0.1.Q5_K_M -o n_gpu_layers 64

LangChain

LangChain is a FOSS library for chaining together prompt-able language models. I’ve been using it for building all sorts of cool stuff.

Structured Outputs

There are a number of ways that we can coerce LLMs into providing structured responses, some of which are described here. My favourites are:

Fine Tuning Local LLMs