Large Multi Modal Models (LMMs) are essentially LLMs that also support other modes of data input such as image or audio.
The most common class of LMM is the Vision/Language Model or VLM. This class of models is growing quickly and the most widely used one is probably GPT-4. However, there are a number of other LMM offerings both closed and open source including LLaVa and Qwen-VL.
Earlier multi-modal models like LayoutLM did not use image encoding approaches like CLIP.
See more models in List of LLM and VLM models and pages tagged withLMM andmultimodal