Large Multi-Modal Model

Large Multi Modal Models (LMMs) are essentially LLMs that also support other modes of data input such as image or audio.

The most common class of LMM is the Vision/Language Model or VLM. This class of models is growing quickly and the most widely used one is probably GPT-4. However, there are a number of other LMM offerings both closed and open source including LLaVa and Qwen-VL.

Earlier multi-modal models like LayoutLM did not use image encoding approaches like CLIP.

See more models in List of LLM and VLM models and pages tagged with LMM and multimodal

Brainsteam

Explorer

Large Multi-Modal Model

Graph View

Backlinks