Website Llava 1.0 Paper Source Code Model Weights

Llava is an LMM that builds on top of LLaMa with CLIP image encoding capabilities.

The original LLaVA 1.0 was largely built by getting GPT-4 (before it had vision capability) to expand on short descriptions of images. This has the obvious downside that the model was not working off the actual image content and thus may hallucinate details that are not in the image.

LLaVa 1.5 and 1.6

LLAVA 1.6 is trained on data with more in-depth annotations including LAION GPT-4V which, incidentally, I can’t find any formal evaluations of. This seems to improve the model’s ability to reason about the contents of images. They also train using TextVQA which aims to train models to be able to reason about and perform question answering for text documents. It is likely that elements of both of these datasets help improve the model’s OCR ability.

Running LLAVA with llama.cpp

  1. Clone the source code for llama from the git repository

  2. Create a build directory and cd into it: mkdir build; cd build

  3. Run cmake, pass in details of your cuda installation if appropriate:

cmake .. -DLLAMA_CUBLAS=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc
  1. Make the llava-cli component.

  2. Download relevant models as documented in the llava README. I personally found the best models were:

  3. Run the model. Note that you must set the context manually with -c 4096 or you will get an error message.

./bin/llava-cli -m ./path/to/models/ggml-mistral-7b-q_5_k.gguf \
  --mmproj ../models/mmproj-mistral7b-f16.gguf \
  --image /path/to/test2.jpeg \
  -p "please carefully transcribe these handwritten notes without hallucinating." \
  -ngl 30 \-ngld 30 \
  -c 4096